* Mainline kernel OLTP performance update @ 2009-01-13 21:10 Ma, Chinang 2009-01-13 22:44 ` Wilcox, Matthew R 0 siblings, 1 reply; 93+ messages in thread From: Ma, Chinang @ 2009-01-13 21:10 UTC (permalink / raw) To: linux-kernel Cc: Tripathi, Sharad C, arjan, Wilcox, Matthew R, Kleen, Andi, Siddha, Suresh B, Chilukuri, Harita, Styner, Douglas W, Wang, Peter Xihong, Nueckel, Hubert, Chris Mason This is latest 2.6.29-rc1 kernel OLTP performance result. Compare to 2.6.24.2 the regression is around 3.5%. Linux OLTP Performance summary Kernel# Speedup(x) Intr/s CtxSw/s us% sys% idle% iowait% 2.6.24.2 1.000 21969 43425 76 24 0 0 2.6.27.2 0.973 30402 43523 74 25 0 1 2.6.29-rc1 0.965 30331 41970 74 26 0 0 Server configurations: Intel Xeon Quad-core 2.0GHz 2 cpus/8 cores/8 threads 64GB memory, 3 qle2462 FC HBA, 450 spindles (30 logical units) ======oprofile CPU_CLK_UNHALTED for top 30 functions Cycles% 2.6.24.2 Cycles% 2.6.27.2 1.0500 qla24xx_start_scsi 1.2125 qla24xx_start_scsi 0.8089 schedule 0.6962 kmem_cache_alloc 0.5864 kmem_cache_alloc 0.6209 qla24xx_intr_handler 0.4989 __blockdev_direct_IO 0.4895 copy_user_generic_string 0.4152 copy_user_generic_string 0.4591 __blockdev_direct_IO 0.3953 qla24xx_intr_handler 0.4409 __end_that_request_first 0.3596 scsi_request_fn 0.3729 __switch_to 0.3188 __switch_to 0.3716 try_to_wake_up 0.2889 lock_timer_base 0.3531 lock_timer_base 0.2519 task_rq_lock 0.3393 scsi_request_fn 0.2474 aio_complete 0.3038 aio_complete 0.2460 scsi_alloc_sgtable 0.2989 memset_c 0.2445 generic_make_request 0.2633 qla2x00_process_completed_re 0.2263 qla2x00_process_completed_re0.2583 pick_next_highest_task_rt 0.2118 blk_queue_end_tag 0.2578 generic_make_request 0.2085 dio_bio_complete 0.2510 __list_add 0.2021 e1000_xmit_frame 0.2459 task_rq_lock 0.2006 __end_that_request_first 0.2322 kmem_cache_free 0.1954 generic_file_aio_read 0.2206 blk_queue_end_tag 0.1949 kfree 0.2205 __mod_timer 0.1915 tcp_sendmsg 0.2179 update_curr_rt 0.1901 try_to_wake_up 0.2164 sd_prep_fn 0.1895 kref_get 0.2130 kref_get 0.1864 __mod_timer 0.2075 dio_bio_complete 0.1863 thread_return 0.2066 push_rt_task 0.1854 math_state_restore 0.1974 qla24xx_msix_default 0.1775 __list_add 0.1935 generic_file_aio_read 0.1721 memset_c 0.1870 scsi_device_unbusy 0.1706 find_vma 0.1861 tcp_sendmsg 0.1688 read_tsc 0.1843 e1000_xmit_frame ======oprofile CPU_CLK_UNHALTED for top 30 functions Cycles% 2.6.24.2 Cycles% 2.6.29-rc1 1.0500 qla24xx_start_scsi 1.0691 qla24xx_intr_handler 0.8089 schedule 0.7701 copy_user_generic_string 0.5864 kmem_cache_alloc 0.7339 qla24xx_wrt_req_reg 0.4989 __blockdev_direct_IO 0.6458 kmem_cache_alloc 0.4152 copy_user_generic_string 0.5794 qla24xx_start_scsi 0.3953 qla24xx_intr_handler 0.5505 unmap_vmas 0.3596 scsi_request_fn 0.4869 __blockdev_direct_IO 0.3188 __switch_to 0.4493 try_to_wake_up 0.2889 lock_timer_base 0.4291 scsi_request_fn 0.2519 task_rq_lock 0.4118 clear_page_c 0.2474 aio_complete 0.4002 __switch_to 0.2460 scsi_alloc_sgtable 0.3381 ring_buffer_consume 0.2445 generic_make_request 0.3366 rb_get_reader_page 0.2263 qla2x00_process_completed_re0.3222 aio_complete 0.2118 blk_queue_end_tag 0.3135 memset_c 0.2085 dio_bio_complete 0.2875 __list_add 0.2021 e1000_xmit_frame 0.2673 task_rq_lock 0.2006 __end_that_request_first 0.2658 __end_that_request_first 0.1954 generic_file_aio_read 0.2615 qla2x00_process_completed_re 0.1949 kfree 0.2615 lock_timer_base 0.1915 tcp_sendmsg 0.2456 disk_map_sector_rcu 0.1901 try_to_wake_up 0.2427 tcp_sendmsg 0.1895 kref_get 0.2413 e1000_xmit_frame 0.1864 __mod_timer 0.2398 kmem_cache_free 0.1863 thread_return 0.2384 pick_next_highest_task_rt 0.1854 math_state_restore 0.2225 blk_queue_end_tag 0.1775 __list_add 0.2211 sd_prep_fn 0.1721 memset_c 0.2167 qla24xx_queuecommand 0.1706 find_vma 0.2109 scsi_device_unbusy 0.1688 read_tsc 0.2095 kref_get ^ permalink raw reply [flat|nested] 93+ messages in thread
* RE: Mainline kernel OLTP performance update 2009-01-13 21:10 Mainline kernel OLTP performance update Ma, Chinang @ 2009-01-13 22:44 ` Wilcox, Matthew R 2009-01-15 0:35 ` Andrew Morton 0 siblings, 1 reply; 93+ messages in thread From: Wilcox, Matthew R @ 2009-01-13 22:44 UTC (permalink / raw) To: Ma, Chinang, linux-kernel Cc: Tripathi, Sharad C, arjan, Kleen, Andi, Siddha, Suresh B, Chilukuri, Harita, Styner, Douglas W, Wang, Peter Xihong, Nueckel, Hubert, Chris Mason, Steven Rostedt [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #1: Type: text/plain; charset="utf-8", Size: 7470 bytes --] One encouraging thing is that we don't see a significant drop-off between 2.6.28 and 2.6.29-rc1, which I think is the first time we've not seen a big problem with -rc1. To compare the top 30 functions between 2.6.28 and 2.6.29-rc1: 1.4257 qla24xx_start_scsi 1.0691 qla24xx_intr_handler 0.8784 kmem_cache_alloc 0.7701 copy_user_generic_string 0.6876 qla24xx_intr_handler 0.7339 qla24xx_wrt_req_reg 0.5834 copy_user_generic_string 0.6458 kmem_cache_alloc 0.4945 scsi_request_fn 0.5794 qla24xx_start_scsi 0.4846 __blockdev_direct_IO 0.5505 unmap_vmas 0.4187 try_to_wake_up 0.4869 __blockdev_direct_IO 0.3518 aio_complete 0.4493 try_to_wake_up 0.3513 __end_that_request_first 0.4291 scsi_request_fn 0.3483 __switch_to 0.4118 clear_page_c 0.3271 memset_c 0.4002 __switch_to 0.2976 qla2x00_process_completed_re 0.3381 ring_buffer_consume 0.2905 __list_add 0.3366 rb_get_reader_page 0.2901 generic_make_request 0.3222 aio_complete 0.2755 lock_timer_base 0.3135 memset_c 0.2741 blk_queue_end_tag 0.2875 __list_add 0.2593 kmem_cache_free 0.2673 task_rq_lock 0.2445 disk_map_sector_rcu 0.2658 __end_that_request_first 0.2370 pick_next_highest_task_rt 0.2615 qla2x00_process_completed_re 0.2323 scsi_device_unbusy 0.2615 lock_timer_base 0.2321 task_rq_lock 0.2456 disk_map_sector_rcu 0.2316 scsi_dispatch_cmd 0.2427 tcp_sendmsg 0.2239 kref_get 0.2413 e1000_xmit_frame 0.2237 dio_bio_complete 0.2398 kmem_cache_free 0.2194 push_rt_task 0.2384 pick_next_highest_task_rt 0.2145 __aio_get_req 0.2225 blk_queue_end_tag 0.2143 kfree 0.2211 sd_prep_fn 0.2138 __mod_timer 0.2167 qla24xx_queuecommand 0.2131 e1000_irq_enable 0.2109 scsi_device_unbusy 0.2091 scsi_softirq_done 0.2095 kref_get It looks like a number of functions in the qla2x00 driver were split up, so it's probably best to ignore all the changes in qla* functions. unmap_vmas is a new hot function. It's been around since before git history started, and hasn't changed substantially between 2.6.28 and 2.6.29-rc1, so I suspect we're calling it more often. I don't know why we'd be doing that. clear_page_c is also new to the hot list. I haven't tried to understand why this might be so. The ring_buffer_consume() and rb_get_reader_page() functions are part of the oprofile code. This seems to indicate a bug -- they should not be the #12 and #13 hottest functions in the kernel when monitoring a database run! That seems to be about it for regressions. > -----Original Message----- > From: Ma, Chinang > Sent: Tuesday, January 13, 2009 1:11 PM > To: linux-kernel@vger.kernel.org > Cc: Tripathi, Sharad C; arjan@linux.intel.com; Wilcox, Matthew R; Kleen, > Andi; Siddha, Suresh B; Chilukuri, Harita; Styner, Douglas W; Wang, Peter > Xihong; Nueckel, Hubert; Chris Mason > Subject: Mainline kernel OLTP performance update > > This is latest 2.6.29-rc1 kernel OLTP performance result. Compare to > 2.6.24.2 the regression is around 3.5%. > > Linux OLTP Performance summary > Kernel# Speedup(x) Intr/s CtxSw/s us% sys% idle% iowait% > 2.6.24.2 1.000 21969 43425 76 24 0 0 > 2.6.27.2 0.973 30402 43523 74 25 0 1 > 2.6.29-rc1 0.965 30331 41970 74 26 0 0 > > Server configurations: > Intel Xeon Quad-core 2.0GHz 2 cpus/8 cores/8 threads > 64GB memory, 3 qle2462 FC HBA, 450 spindles (30 logical units) > > ======oprofile CPU_CLK_UNHALTED for top 30 functions > Cycles% 2.6.24.2 Cycles% 2.6.27.2 > 1.0500 qla24xx_start_scsi 1.2125 qla24xx_start_scsi > 0.8089 schedule 0.6962 kmem_cache_alloc > 0.5864 kmem_cache_alloc 0.6209 qla24xx_intr_handler > 0.4989 __blockdev_direct_IO 0.4895 copy_user_generic_string > 0.4152 copy_user_generic_string 0.4591 __blockdev_direct_IO > 0.3953 qla24xx_intr_handler 0.4409 __end_that_request_first > 0.3596 scsi_request_fn 0.3729 __switch_to > 0.3188 __switch_to 0.3716 try_to_wake_up > 0.2889 lock_timer_base 0.3531 lock_timer_base > 0.2519 task_rq_lock 0.3393 scsi_request_fn > 0.2474 aio_complete 0.3038 aio_complete > 0.2460 scsi_alloc_sgtable 0.2989 memset_c > 0.2445 generic_make_request 0.2633 qla2x00_process_completed_re > 0.2263 qla2x00_process_completed_re0.2583 pick_next_highest_task_rt > 0.2118 blk_queue_end_tag 0.2578 generic_make_request > 0.2085 dio_bio_complete 0.2510 __list_add > 0.2021 e1000_xmit_frame 0.2459 task_rq_lock > 0.2006 __end_that_request_first 0.2322 kmem_cache_free > 0.1954 generic_file_aio_read 0.2206 blk_queue_end_tag > 0.1949 kfree 0.2205 __mod_timer > 0.1915 tcp_sendmsg 0.2179 update_curr_rt > 0.1901 try_to_wake_up 0.2164 sd_prep_fn > 0.1895 kref_get 0.2130 kref_get > 0.1864 __mod_timer 0.2075 dio_bio_complete > 0.1863 thread_return 0.2066 push_rt_task > 0.1854 math_state_restore 0.1974 qla24xx_msix_default > 0.1775 __list_add 0.1935 generic_file_aio_read > 0.1721 memset_c 0.1870 scsi_device_unbusy > 0.1706 find_vma 0.1861 tcp_sendmsg > 0.1688 read_tsc 0.1843 e1000_xmit_frame > > ======oprofile CPU_CLK_UNHALTED for top 30 functions > Cycles% 2.6.24.2 Cycles% 2.6.29-rc1 > 1.0500 qla24xx_start_scsi 1.0691 qla24xx_intr_handler > 0.8089 schedule 0.7701 copy_user_generic_string > 0.5864 kmem_cache_alloc 0.7339 qla24xx_wrt_req_reg > 0.4989 __blockdev_direct_IO 0.6458 kmem_cache_alloc > 0.4152 copy_user_generic_string 0.5794 qla24xx_start_scsi > 0.3953 qla24xx_intr_handler 0.5505 unmap_vmas > 0.3596 scsi_request_fn 0.4869 __blockdev_direct_IO > 0.3188 __switch_to 0.4493 try_to_wake_up > 0.2889 lock_timer_base 0.4291 scsi_request_fn > 0.2519 task_rq_lock 0.4118 clear_page_c > 0.2474 aio_complete 0.4002 __switch_to > 0.2460 scsi_alloc_sgtable 0.3381 ring_buffer_consume > 0.2445 generic_make_request 0.3366 rb_get_reader_page > 0.2263 qla2x00_process_completed_re0.3222 aio_complete > 0.2118 blk_queue_end_tag 0.3135 memset_c > 0.2085 dio_bio_complete 0.2875 __list_add > 0.2021 e1000_xmit_frame 0.2673 task_rq_lock > 0.2006 __end_that_request_first 0.2658 __end_that_request_first > 0.1954 generic_file_aio_read 0.2615 qla2x00_process_completed_re > 0.1949 kfree 0.2615 lock_timer_base > 0.1915 tcp_sendmsg 0.2456 disk_map_sector_rcu > 0.1901 try_to_wake_up 0.2427 tcp_sendmsg > 0.1895 kref_get 0.2413 e1000_xmit_frame > 0.1864 __mod_timer 0.2398 kmem_cache_free > 0.1863 thread_return 0.2384 pick_next_highest_task_rt > 0.1854 math_state_restore 0.2225 blk_queue_end_tag > 0.1775 __list_add 0.2211 sd_prep_fn > 0.1721 memset_c 0.2167 qla24xx_queuecommand > 0.1706 find_vma 0.2109 scsi_device_unbusy > 0.1688 read_tsc 0.2095 kref_get ÿôèº{.nÇ+·®+%Ëÿ±éݶ\x17¥wÿº{.nÇ+·¥{±þG«éÿ{ayº\x1dÊÚë,j\a¢f£¢·hïêÿêçz_è®\x03(éÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?¨èÚ&£ø§~á¶iOæ¬z·vØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?I¥ ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Mainline kernel OLTP performance update 2009-01-13 22:44 ` Wilcox, Matthew R @ 2009-01-15 0:35 ` Andrew Morton 2009-01-15 1:21 ` Matthew Wilcox 0 siblings, 1 reply; 93+ messages in thread From: Andrew Morton @ 2009-01-15 0:35 UTC (permalink / raw) To: Wilcox, Matthew R Cc: chinang.ma, linux-kernel, sharad.c.tripathi, arjan, andi.kleen, suresh.b.siddha, harita.chilukuri, douglas.w.styner, peter.xihong.wang, hubert.nueckel, chris.mason, srostedt, linux-scsi, Andrew Vasquez, Anirban Chakraborty On Tue, 13 Jan 2009 15:44:17 -0700 "Wilcox, Matthew R" <matthew.r.wilcox@intel.com> wrote: > (top-posting repaired. That @intel.com address is a bad influence ;)) (cc linux-scsi) > > -----Original Message----- > > From: Ma, Chinang > > Sent: Tuesday, January 13, 2009 1:11 PM > > To: linux-kernel@vger.kernel.org > > Cc: Tripathi, Sharad C; arjan@linux.intel.com; Wilcox, Matthew R; Kleen, > > Andi; Siddha, Suresh B; Chilukuri, Harita; Styner, Douglas W; Wang, Peter > > Xihong; Nueckel, Hubert; Chris Mason > > Subject: Mainline kernel OLTP performance update > > > > This is latest 2.6.29-rc1 kernel OLTP performance result. Compare to > > 2.6.24.2 the regression is around 3.5%. > > > > Linux OLTP Performance summary > > Kernel# Speedup(x) Intr/s CtxSw/s us% sys% idle% iowait% > > 2.6.24.2 1.000 21969 43425 76 24 0 0 > > 2.6.27.2 0.973 30402 43523 74 25 0 1 > > 2.6.29-rc1 0.965 30331 41970 74 26 0 0 > > > > Server configurations: > > Intel Xeon Quad-core 2.0GHz 2 cpus/8 cores/8 threads > > 64GB memory, 3 qle2462 FC HBA, 450 spindles (30 logical units) > > > One encouraging thing is that we don't see a significant drop-off between 2.6.28 and 2.6.29-rc1, which I think is the first time we've not seen a big problem with -rc1. > > To compare the top 30 functions between 2.6.28 and 2.6.29-rc1: > > 1.4257 qla24xx_start_scsi 1.0691 qla24xx_intr_handler > 0.8784 kmem_cache_alloc 0.7701 copy_user_generic_string > 0.6876 qla24xx_intr_handler 0.7339 qla24xx_wrt_req_reg > 0.5834 copy_user_generic_string 0.6458 kmem_cache_alloc > 0.4945 scsi_request_fn 0.5794 qla24xx_start_scsi > 0.4846 __blockdev_direct_IO 0.5505 unmap_vmas > 0.4187 try_to_wake_up 0.4869 __blockdev_direct_IO > 0.3518 aio_complete 0.4493 try_to_wake_up > 0.3513 __end_that_request_first 0.4291 scsi_request_fn > 0.3483 __switch_to 0.4118 clear_page_c > 0.3271 memset_c 0.4002 __switch_to > 0.2976 qla2x00_process_completed_re 0.3381 ring_buffer_consume > 0.2905 __list_add 0.3366 rb_get_reader_page > 0.2901 generic_make_request 0.3222 aio_complete > 0.2755 lock_timer_base 0.3135 memset_c > 0.2741 blk_queue_end_tag 0.2875 __list_add > 0.2593 kmem_cache_free 0.2673 task_rq_lock > 0.2445 disk_map_sector_rcu 0.2658 __end_that_request_first > 0.2370 pick_next_highest_task_rt 0.2615 qla2x00_process_completed_re > 0.2323 scsi_device_unbusy 0.2615 lock_timer_base > 0.2321 task_rq_lock 0.2456 disk_map_sector_rcu > 0.2316 scsi_dispatch_cmd 0.2427 tcp_sendmsg > 0.2239 kref_get 0.2413 e1000_xmit_frame > 0.2237 dio_bio_complete 0.2398 kmem_cache_free > 0.2194 push_rt_task 0.2384 pick_next_highest_task_rt > 0.2145 __aio_get_req 0.2225 blk_queue_end_tag > 0.2143 kfree 0.2211 sd_prep_fn > 0.2138 __mod_timer 0.2167 qla24xx_queuecommand > 0.2131 e1000_irq_enable 0.2109 scsi_device_unbusy > 0.2091 scsi_softirq_done 0.2095 kref_get > > It looks like a number of functions in the qla2x00 driver were split up, so it's probably best to ignore all the changes in qla* functions. > > unmap_vmas is a new hot function. It's been around since before git history started, and hasn't changed substantially between 2.6.28 and 2.6.29-rc1, so I suspect we're calling it more often. I don't know why we'd be doing that. > > clear_page_c is also new to the hot list. I haven't tried to understand why this might be so. > > The ring_buffer_consume() and rb_get_reader_page() functions are part of the oprofile code. This seems to indicate a bug -- they should not be the #12 and #13 hottest functions in the kernel when monitoring a database run! > > That seems to be about it for regressions. > But the interrupt rate went through the roof. A 3.5% slowdown in this workload is considered pretty serious, isn't it? ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Mainline kernel OLTP performance update 2009-01-15 0:35 ` Andrew Morton @ 2009-01-15 1:21 ` Matthew Wilcox 2009-01-15 2:04 ` Andrew Morton 2009-01-15 16:48 ` Ma, Chinang 0 siblings, 2 replies; 93+ messages in thread From: Matthew Wilcox @ 2009-01-15 1:21 UTC (permalink / raw) To: Andrew Morton Cc: Wilcox, Matthew R, chinang.ma, linux-kernel, sharad.c.tripathi, arjan, andi.kleen, suresh.b.siddha, harita.chilukuri, douglas.w.styner, peter.xihong.wang, hubert.nueckel, chris.mason, srostedt, linux-scsi, Andrew Vasquez, Anirban Chakraborty On Wed, Jan 14, 2009 at 04:35:57PM -0800, Andrew Morton wrote: > On Tue, 13 Jan 2009 15:44:17 -0700 > "Wilcox, Matthew R" <matthew.r.wilcox@intel.com> wrote: > > > > (top-posting repaired. That @intel.com address is a bad influence ;)) Alas, that email address goes to an Outlook client. Not much to be done about that. > (cc linux-scsi) > > > > This is latest 2.6.29-rc1 kernel OLTP performance result. Compare to > > > 2.6.24.2 the regression is around 3.5%. > > > > > > Linux OLTP Performance summary > > > Kernel# Speedup(x) Intr/s CtxSw/s us% sys% idle% iowait% > > > 2.6.24.2 1.000 21969 43425 76 24 0 0 > > > 2.6.27.2 0.973 30402 43523 74 25 0 1 > > > 2.6.29-rc1 0.965 30331 41970 74 26 0 0 > But the interrupt rate went through the roof. Yes. I forget why that was; I'll have to dig through my archives for that. > A 3.5% slowdown in this workload is considered pretty serious, isn't it? Yes. Anything above 0.3% is statistically significant. 1% is a big deal. The fact that we've lost 3.5% in the last year doesn't make people happy. There's a few things we've identified that have a big effect: - Per-partition statistics. Putting in a sysctl to stop doing them gets some of that back, but not as much as taking them out (even when the sysctl'd variable is in a __read_mostly section). We tried a patch from Jens to speed up the search for a new partition, but it had no effect. - The RT scheduler changes. They're better for some RT tasks, but not the database benchmark workload. Chinang has posted about this before, but the thread didn't really go anywhere. http://marc.info/?t=122903815000001&r=1&w=2 SLUB would have had a huge negative effect if we were using it -- on the order of 7% iirc. SLQB is at least performance-neutral with SLAB. -- Matthew Wilcox Intel Open Source Technology Centre "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Mainline kernel OLTP performance update 2009-01-15 1:21 ` Matthew Wilcox @ 2009-01-15 2:04 ` Andrew Morton 2009-01-15 2:27 ` Steven Rostedt ` (3 more replies) 2009-01-15 16:48 ` Ma, Chinang 1 sibling, 4 replies; 93+ messages in thread From: Andrew Morton @ 2009-01-15 2:04 UTC (permalink / raw) To: Matthew Wilcox Cc: Wilcox, Matthew R, chinang.ma, linux-kernel, sharad.c.tripathi, arjan, andi.kleen, suresh.b.siddha, harita.chilukuri, douglas.w.styner, peter.xihong.wang, hubert.nueckel, chris.mason, srostedt, linux-scsi, Andrew Vasquez, Anirban Chakraborty On Wed, 14 Jan 2009 18:21:47 -0700 Matthew Wilcox <matthew@wil.cx> wrote: > On Wed, Jan 14, 2009 at 04:35:57PM -0800, Andrew Morton wrote: > > On Tue, 13 Jan 2009 15:44:17 -0700 > > "Wilcox, Matthew R" <matthew.r.wilcox@intel.com> wrote: > > > > > > > (top-posting repaired. That @intel.com address is a bad influence ;)) > > Alas, that email address goes to an Outlook client. Not much to be done > about that. aspirin? > > (cc linux-scsi) > > > > > > This is latest 2.6.29-rc1 kernel OLTP performance result. Compare to > > > > 2.6.24.2 the regression is around 3.5%. > > > > > > > > Linux OLTP Performance summary > > > > Kernel# Speedup(x) Intr/s CtxSw/s us% sys% idle% iowait% > > > > 2.6.24.2 1.000 21969 43425 76 24 0 0 > > > > 2.6.27.2 0.973 30402 43523 74 25 0 1 > > > > 2.6.29-rc1 0.965 30331 41970 74 26 0 0 > > > But the interrupt rate went through the roof. > > Yes. I forget why that was; I'll have to dig through my archives for > that. Oh. I'd have thought that this alone could account for 3.5%. > > A 3.5% slowdown in this workload is considered pretty serious, isn't it? > > Yes. Anything above 0.3% is statistically significant. 1% is a big > deal. The fact that we've lost 3.5% in the last year doesn't make > people happy. There's a few things we've identified that have a big > effect: > > - Per-partition statistics. Putting in a sysctl to stop doing them gets > some of that back, but not as much as taking them out (even when > the sysctl'd variable is in a __read_mostly section). We tried a > patch from Jens to speed up the search for a new partition, but it > had no effect. I find this surprising. > - The RT scheduler changes. They're better for some RT tasks, but not > the database benchmark workload. Chinang has posted about > this before, but the thread didn't really go anywhere. > http://marc.info/?t=122903815000001&r=1&w=2 Well. It's more a case that it wasn't taken anywhere. I appear to have recently been informed that there have never been any CPU-scheduler-caused regressions. Please persist! > SLUB would have had a huge negative effect if we were using it -- on the > order of 7% iirc. SLQB is at least performance-neutral with SLAB. We really need to unblock that problem somehow. I assume that enterprise distros are shipping slab? ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Mainline kernel OLTP performance update 2009-01-15 2:04 ` Andrew Morton @ 2009-01-15 2:27 ` Steven Rostedt 2009-01-15 7:11 ` Ma, Chinang 2009-01-15 2:39 ` Andi Kleen ` (2 subsequent siblings) 3 siblings, 1 reply; 93+ messages in thread From: Steven Rostedt @ 2009-01-15 2:27 UTC (permalink / raw) To: Andrew Morton Cc: Matthew Wilcox, Wilcox, Matthew R, chinang.ma, linux-kernel, sharad.c.tripathi, arjan, andi.kleen, suresh.b.siddha, harita.chilukuri, douglas.w.styner, peter.xihong.wang, hubert.nueckel, chris.mason, linux-scsi, Andrew Vasquez, Anirban Chakraborty, Ingo Molnar, Thomas Gleixner, Peter Zijlstra, Gregory Haskins (added Ingo, Thomas, Peter and Gregory) On Wed, 2009-01-14 at 18:04 -0800, Andrew Morton wrote: > On Wed, 14 Jan 2009 18:21:47 -0700 Matthew Wilcox <matthew@wil.cx> wrote: > > > On Wed, Jan 14, 2009 at 04:35:57PM -0800, Andrew Morton wrote: > > > On Tue, 13 Jan 2009 15:44:17 -0700 > > > "Wilcox, Matthew R" <matthew.r.wilcox@intel.com> wrote: > > > > > > > > > > (top-posting repaired. That @intel.com address is a bad influence ;)) > > > > Alas, that email address goes to an Outlook client. Not much to be done > > about that. > > aspirin? > > > > (cc linux-scsi) > > > > > > > > This is latest 2.6.29-rc1 kernel OLTP performance result. Compare to > > > > > 2.6.24.2 the regression is around 3.5%. > > > > > > > > > > Linux OLTP Performance summary > > > > > Kernel# Speedup(x) Intr/s CtxSw/s us% sys% idle% iowait% > > > > > 2.6.24.2 1.000 21969 43425 76 24 0 0 > > > > > 2.6.27.2 0.973 30402 43523 74 25 0 1 > > > > > 2.6.29-rc1 0.965 30331 41970 74 26 0 0 > > > > > But the interrupt rate went through the roof. > > > > Yes. I forget why that was; I'll have to dig through my archives for > > that. > > Oh. I'd have thought that this alone could account for 3.5%. > > > > A 3.5% slowdown in this workload is considered pretty serious, isn't it? > > > > Yes. Anything above 0.3% is statistically significant. 1% is a big > > deal. The fact that we've lost 3.5% in the last year doesn't make > > people happy. There's a few things we've identified that have a big > > effect: > > > > - Per-partition statistics. Putting in a sysctl to stop doing them gets > > some of that back, but not as much as taking them out (even when > > the sysctl'd variable is in a __read_mostly section). We tried a > > patch from Jens to speed up the search for a new partition, but it > > had no effect. > > I find this surprising. > > > - The RT scheduler changes. They're better for some RT tasks, but not > > the database benchmark workload. Chinang has posted about > > this before, but the thread didn't really go anywhere. > > http://marc.info/?t=122903815000001&r=1&w=2 I read the whole thread before I found what you were talking about here: http://marc.info/?l=linux-kernel&m=122937424114658&w=2 With this comment: "When setting foreground and log writer to rt-prio, the log latency reduced to 4.8ms. \ Performance is about 1.5% higher than the CFS result. On a side note, we had been using rt-prio on all DBMS processes and log writer ( in \ higher priority) for the best OLTP performance. That has worked pretty well until \ 2.6.25 when the new rt scheduler introduced the pull/push task for lower scheduling \ latency for rt-task. That has negative impact on this workload, probably due to the \ more elaborated load calculation/balancing for hundred of foreground rt-prio \ processes. Also, there is that question of no production environment would run DBMS \ with rt-prio. That is why I am going back to explore CFS and see whether I can drop \ rt-prio for good." A couple of questions: 1) how does the latest rt scheduler compare? There has been a lot of improvements. 2) how many rt tasks? 3) what were the prios, producer compared to consumers, not actual numbers 4) have you tried pinning tasks? RT is more about determinism than performance. The old scheduler migrated rt tasks the same as other tasks. This helps with performance because it will keep several rt tasks on the same CPU and cache hot even when a rt task can migrate. This helps performance, but kills determinism (I was seeing 10 ms wake up times from the next-highest-prio task on a cpu, even when another CPU was available). If you pin a task to a cpu, then it skips over the push and pull logic and will help with performance too. -- Steve > > Well. It's more a case that it wasn't taken anywhere. I appear to > have recently been informed that there have never been any > CPU-scheduler-caused regressions. Please persist! > > > SLUB would have had a huge negative effect if we were using it -- on the > > order of 7% iirc. SLQB is at least performance-neutral with SLAB. > > We really need to unblock that problem somehow. I assume that > enterprise distros are shipping slab? > ^ permalink raw reply [flat|nested] 93+ messages in thread
* RE: Mainline kernel OLTP performance update 2009-01-15 2:27 ` Steven Rostedt @ 2009-01-15 7:11 ` Ma, Chinang 2009-01-19 18:04 ` Chris Mason 0 siblings, 1 reply; 93+ messages in thread From: Ma, Chinang @ 2009-01-15 7:11 UTC (permalink / raw) To: Steven Rostedt, Andrew Morton Cc: Matthew Wilcox, Wilcox, Matthew R, linux-kernel, Tripathi, Sharad C, arjan, Kleen, Andi, Siddha, Suresh B, Chilukuri, Harita, Styner, Douglas W, Wang, Peter Xihong, Nueckel, Hubert, chris.mason, linux-scsi, Andrew Vasquez, Anirban Chakraborty, Ingo Molnar, Thomas Gleixner, Peter Zijlstra, Gregory Haskins Trying to answer to some of the question below: -Chinang >-----Original Message----- >From: Steven Rostedt [mailto:srostedt@redhat.com] >Sent: Wednesday, January 14, 2009 6:27 PM >To: Andrew Morton >Cc: Matthew Wilcox; Wilcox, Matthew R; Ma, Chinang; linux- >kernel@vger.kernel.org; Tripathi, Sharad C; arjan@linux.intel.com; Kleen, >Andi; Siddha, Suresh B; Chilukuri, Harita; Styner, Douglas W; Wang, Peter >Xihong; Nueckel, Hubert; chris.mason@oracle.com; linux-scsi@vger.kernel.org; >Andrew Vasquez; Anirban Chakraborty; Ingo Molnar; Thomas Gleixner; Peter >Zijlstra; Gregory Haskins >Subject: Re: Mainline kernel OLTP performance update > >(added Ingo, Thomas, Peter and Gregory) > >On Wed, 2009-01-14 at 18:04 -0800, Andrew Morton wrote: >> On Wed, 14 Jan 2009 18:21:47 -0700 Matthew Wilcox <matthew@wil.cx> wrote: >> >> > On Wed, Jan 14, 2009 at 04:35:57PM -0800, Andrew Morton wrote: >> > > On Tue, 13 Jan 2009 15:44:17 -0700 >> > > "Wilcox, Matthew R" <matthew.r.wilcox@intel.com> wrote: >> > > > >> > > >> > > (top-posting repaired. That @intel.com address is a bad influence ;)) >> > >> > Alas, that email address goes to an Outlook client. Not much to be >done >> > about that. >> >> aspirin? >> >> > > (cc linux-scsi) >> > > >> > > > > This is latest 2.6.29-rc1 kernel OLTP performance result. Compare >to >> > > > > 2.6.24.2 the regression is around 3.5%. >> > > > > >> > > > > Linux OLTP Performance summary >> > > > > Kernel# Speedup(x) Intr/s CtxSw/s us% sys% idle% >iowait% >> > > > > 2.6.24.2 1.000 21969 43425 76 24 0 >0 >> > > > > 2.6.27.2 0.973 30402 43523 74 25 0 >1 >> > > > > 2.6.29-rc1 0.965 30331 41970 74 26 0 >0 >> > >> > > But the interrupt rate went through the roof. >> > >> > Yes. I forget why that was; I'll have to dig through my archives for >> > that. >> >> Oh. I'd have thought that this alone could account for 3.5%. >> >> > > A 3.5% slowdown in this workload is considered pretty serious, isn't >it? >> > >> > Yes. Anything above 0.3% is statistically significant. 1% is a big >> > deal. The fact that we've lost 3.5% in the last year doesn't make >> > people happy. There's a few things we've identified that have a big >> > effect: >> > >> > - Per-partition statistics. Putting in a sysctl to stop doing them >gets >> > some of that back, but not as much as taking them out (even when >> > the sysctl'd variable is in a __read_mostly section). We tried a >> > patch from Jens to speed up the search for a new partition, but it >> > had no effect. >> >> I find this surprising. >> >> > - The RT scheduler changes. They're better for some RT tasks, but not >> > the database benchmark workload. Chinang has posted about >> > this before, but the thread didn't really go anywhere. >> > http://marc.info/?t=122903815000001&r=1&w=2 > >I read the whole thread before I found what you were talking about here: > >http://marc.info/?l=linux-kernel&m=122937424114658&w=2 > >With this comment: > >"When setting foreground and log writer to rt-prio, the log latency reduced >to 4.8ms. \ >Performance is about 1.5% higher than the CFS result. >On a side note, we had been using rt-prio on all DBMS processes and log >writer ( in \ >higher priority) for the best OLTP performance. That has worked pretty well >until \ >2.6.25 when the new rt scheduler introduced the pull/push task for lower >scheduling \ >latency for rt-task. That has negative impact on this workload, probably >due to the \ >more elaborated load calculation/balancing for hundred of foreground rt- >prio \ >processes. Also, there is that question of no production environment would >run DBMS \ >with rt-prio. That is why I am going back to explore CFS and see whether I >can drop \ >rt-prio for good." > >A couple of questions: > >1) how does the latest rt scheduler compare? There has been a lot of >improvements. It is difficult for me to isolate the recent rt scheduler improvement as so many other changes were introduced to the kernel at the same time. A more accurate comparison should just revert the rt-scheduler back to the previous version and test the delta. I am not sure how to get that done. >2) how many rt tasks? Around 250 rt tasks. >3) what were the prios, producer compared to consumers, not actual numbers I suppose the single log writer is the main producer (rt-prio 49, higheset rt-prio in this workload) which wake up all foreground process when the log write is done. The 240 foreground processes are the consumer (rt-prio 48). At any given time some number of the 240 foreground was waiting for log writer to finish flushing out the log data. >4) have you tried pinning tasks? > We did try pin foreground rt-process to cpu. That recovered about 1% performance but introduce idle time in some cpu. Without load balancing, my solution is to pin more processes to the idle cpu. I don't think this is a practical solution for the idle time problem as the process distribution need to be adjusted again when upgrade to a different server. >RT is more about determinism than performance. The old scheduler >migrated rt tasks the same as other tasks. This helps with performance >because it will keep several rt tasks on the same CPU and cache hot even >when a rt task can migrate. This helps performance, but kills >determinism (I was seeing 10 ms wake up times from the next-highest-prio >task on a cpu, even when another CPU was available). > >If you pin a task to a cpu, then it skips over the push and pull logic >and will help with performance too. > >-- Steve > > > >> >> Well. It's more a case that it wasn't taken anywhere. I appear to >> have recently been informed that there have never been any >> CPU-scheduler-caused regressions. Please persist! >> >> > SLUB would have had a huge negative effect if we were using it -- on >the >> > order of 7% iirc. SLQB is at least performance-neutral with SLAB. >> >> We really need to unblock that problem somehow. I assume that >> enterprise distros are shipping slab? >> ^ permalink raw reply [flat|nested] 93+ messages in thread
* RE: Mainline kernel OLTP performance update 2009-01-15 7:11 ` Ma, Chinang @ 2009-01-19 18:04 ` Chris Mason 2009-01-19 18:37 ` Steven Rostedt 0 siblings, 1 reply; 93+ messages in thread From: Chris Mason @ 2009-01-19 18:04 UTC (permalink / raw) To: Ma, Chinang Cc: Steven Rostedt, Andrew Morton, Matthew Wilcox, Wilcox, Matthew R, linux-kernel, Tripathi, Sharad C, arjan, Kleen, Andi, Siddha, Suresh B, Chilukuri, Harita, Styner, Douglas W, Wang, Peter Xihong, Nueckel, Hubert, linux-scsi, Andrew Vasquez, Anirban Chakraborty, Ingo Molnar, Thomas Gleixner, Peter Zijlstra, Gregory Haskins On Thu, 2009-01-15 at 00:11 -0700, Ma, Chinang wrote: > >> > > > > > >> > > > > Linux OLTP Performance summary > >> > > > > Kernel# Speedup(x) Intr/s CtxSw/s us% sys% idle% > >iowait% > >> > > > > 2.6.24.2 1.000 21969 43425 76 24 0 > >0 > >> > > > > 2.6.27.2 0.973 30402 43523 74 25 0 > >1 > >> > > > > 2.6.29-rc1 0.965 30331 41970 74 26 0 > >0 > >> > > >> > > But the interrupt rate went through the roof. > >> > > >> > Yes. I forget why that was; I'll have to dig through my archives for > >> > that. > >> > >> Oh. I'd have thought that this alone could account for 3.5%. A later email indicated the reschedule interrupt count doubled since 2.6.24, and so I poked around a bit at the causes of resched_task. I think the -rt version of check_preempt_equal_prio has gotten much more expensive since 2.6.24. I'm sure these changes were made for good reasons, and this workload may not be a good reason to change it back. But, what does the patch below do to performance on 2.6.29-rcX? -chris diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c index 954e1a8..bbe3492 100644 --- a/kernel/sched_rt.c +++ b/kernel/sched_rt.c @@ -842,6 +842,7 @@ static void check_preempt_curr_rt(struct rq *rq, struct task_struct *p, int sync resched_task(rq->curr); return; } + return; #ifdef CONFIG_SMP /* ^ permalink raw reply related [flat|nested] 93+ messages in thread
* RE: Mainline kernel OLTP performance update 2009-01-19 18:04 ` Chris Mason @ 2009-01-19 18:37 ` Steven Rostedt 2009-01-19 18:55 ` Chris Mason 2009-01-19 23:40 ` Ingo Molnar 0 siblings, 2 replies; 93+ messages in thread From: Steven Rostedt @ 2009-01-19 18:37 UTC (permalink / raw) To: Chris Mason Cc: Ma, Chinang, Andrew Morton, Matthew Wilcox, Wilcox, Matthew R, linux-kernel, Tripathi, Sharad C, arjan, Kleen, Andi, Siddha, Suresh B, Chilukuri, Harita, Styner, Douglas W, Wang, Peter Xihong, Nueckel, Hubert, linux-scsi, Andrew Vasquez, Anirban Chakraborty, Ingo Molnar, Thomas Gleixner, Peter Zijlstra, Gregory Haskins, Rusty Russell (added Rusty) On Mon, 2009-01-19 at 13:04 -0500, Chris Mason wrote: > On Thu, 2009-01-15 at 00:11 -0700, Ma, Chinang wrote: > > >> > > > > > > >> > > > > Linux OLTP Performance summary > > >> > > > > Kernel# Speedup(x) Intr/s CtxSw/s us% sys% idle% > > >iowait% > > >> > > > > 2.6.24.2 1.000 21969 43425 76 24 0 > > >0 > > >> > > > > 2.6.27.2 0.973 30402 43523 74 25 0 > > >1 > > >> > > > > 2.6.29-rc1 0.965 30331 41970 74 26 0 > > >0 > > >> > > > >> > > But the interrupt rate went through the roof. > > >> > > > >> > Yes. I forget why that was; I'll have to dig through my archives for > > >> > that. > > >> > > >> Oh. I'd have thought that this alone could account for 3.5%. > > A later email indicated the reschedule interrupt count doubled since > 2.6.24, and so I poked around a bit at the causes of resched_task. > > I think the -rt version of check_preempt_equal_prio has gotten much more > expensive since 2.6.24. > > I'm sure these changes were made for good reasons, and this workload may > not be a good reason to change it back. But, what does the patch below > do to performance on 2.6.29-rcX? > > -chris > > diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c > index 954e1a8..bbe3492 100644 > --- a/kernel/sched_rt.c > +++ b/kernel/sched_rt.c > @@ -842,6 +842,7 @@ static void check_preempt_curr_rt(struct rq *rq, > struct task_struct *p, int sync > resched_task(rq->curr); > return; > } > + return; > > #ifdef CONFIG_SMP > /* That should not cause much of a problem if the scheduling task is not pinned to an CPU. But!!!!! A recent change makes it expensive: commit 24600ce89a819a8f2fb4fd69fd777218a82ade20 Author: Rusty Russell <rusty@rustcorp.com.au> Date: Tue Nov 25 02:35:13 2008 +1030 sched: convert check_preempt_equal_prio to cpumask_var_t. Impact: stack reduction for large NR_CPUS which has: static void check_preempt_equal_prio(struct rq *rq, struct task_struct *p) { - cpumask_t mask; + cpumask_var_t mask; if (rq->curr->rt.nr_cpus_allowed == 1) return; - if (p->rt.nr_cpus_allowed != 1 - && cpupri_find(&rq->rd->cpupri, p, &mask)) + if (!alloc_cpumask_var(&mask, GFP_ATOMIC)) return; check_preempt_equal_prio is in a scheduling hot path!!!!! WTF are we allocating there for? -- Steve ^ permalink raw reply [flat|nested] 93+ messages in thread
* RE: Mainline kernel OLTP performance update 2009-01-19 18:37 ` Steven Rostedt @ 2009-01-19 18:55 ` Chris Mason 2009-01-19 19:07 ` Steven Rostedt 2009-01-19 23:40 ` Ingo Molnar 1 sibling, 1 reply; 93+ messages in thread From: Chris Mason @ 2009-01-19 18:55 UTC (permalink / raw) To: Steven Rostedt Cc: Ma, Chinang, Andrew Morton, Matthew Wilcox, Wilcox, Matthew R, linux-kernel, Tripathi, Sharad C, arjan, Kleen, Andi, Siddha, Suresh B, Chilukuri, Harita, Styner, Douglas W, Wang, Peter Xihong, Nueckel, Hubert, linux-scsi, Andrew Vasquez, Anirban Chakraborty, Ingo Molnar, Thomas Gleixner, Peter Zijlstra, Gregory Haskins, Rusty Russell On Mon, 2009-01-19 at 13:37 -0500, Steven Rostedt wrote: > (added Rusty) > > On Mon, 2009-01-19 at 13:04 -0500, Chris Mason wrote: > > > > I think the -rt version of check_preempt_equal_prio has gotten much more > > expensive since 2.6.24. > > > > I'm sure these changes were made for good reasons, and this workload may > > not be a good reason to change it back. But, what does the patch below > > do to performance on 2.6.29-rcX? > > > > -chris > > > > diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c > > index 954e1a8..bbe3492 100644 > > --- a/kernel/sched_rt.c > > +++ b/kernel/sched_rt.c > > @@ -842,6 +842,7 @@ static void check_preempt_curr_rt(struct rq *rq, > > struct task_struct *p, int sync > > resched_task(rq->curr); > > return; > > } > > + return; > > > > #ifdef CONFIG_SMP > > /* > > That should not cause much of a problem if the scheduling task is not > pinned to an CPU. But!!!!! > > A recent change makes it expensive: > + if (!alloc_cpumask_var(&mask, GFP_ATOMIC)) > return; > check_preempt_equal_prio is in a scheduling hot path!!!!! > > WTF are we allocating there for? I wasn't actually looking at the cost of the checks, even though they do look higher (if they are using CONFIG_CPUMASK_OFFSTACK anyway). The 2.6.24 code would trigger a rescheduling interrupt only when the prio of the inbound task was higher than the running task. This workload has a large number of equal priority rt tasks that are not bound to a single CPU, and so I think it should trigger more preempts/reschedules with the today's check_preempt_equal_prio(). -chris ^ permalink raw reply [flat|nested] 93+ messages in thread
* RE: Mainline kernel OLTP performance update 2009-01-19 18:55 ` Chris Mason @ 2009-01-19 19:07 ` Steven Rostedt 0 siblings, 0 replies; 93+ messages in thread From: Steven Rostedt @ 2009-01-19 19:07 UTC (permalink / raw) To: Chris Mason Cc: Ma, Chinang, Andrew Morton, Matthew Wilcox, Wilcox, Matthew R, linux-kernel, Tripathi, Sharad C, arjan, Kleen, Andi, Siddha, Suresh B, Chilukuri, Harita, Styner, Douglas W, Wang, Peter Xihong, Nueckel, Hubert, linux-scsi, Andrew Vasquez, Anirban Chakraborty, Ingo Molnar, Thomas Gleixner, Peter Zijlstra, Gregory Haskins, Rusty Russell On Mon, 2009-01-19 at 13:55 -0500, Chris Mason wrote: > I wasn't actually looking at the cost of the checks, even though they do > look higher (if they are using CONFIG_CPUMASK_OFFSTACK anyway). > > The 2.6.24 code would trigger a rescheduling interrupt only when the > prio of the inbound task was higher than the running task. > > This workload has a large number of equal priority rt tasks that are not > bound to a single CPU, and so I think it should trigger more > preempts/reschedules with the today's check_preempt_equal_prio(). Ah yeah. This is one of the things that shows RT being more "responsive" but less on performance. An RT task wants to run ASAP even if that means there's a chance of more interrupts and higher cache misses. The old way would be much faster in general through put, but I measured RT tasks taking up to tens of milliseconds to get scheduled. This is unacceptable for an RT task. -- Steve ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Mainline kernel OLTP performance update 2009-01-19 18:37 ` Steven Rostedt 2009-01-19 18:55 ` Chris Mason @ 2009-01-19 23:40 ` Ingo Molnar 1 sibling, 0 replies; 93+ messages in thread From: Ingo Molnar @ 2009-01-19 23:40 UTC (permalink / raw) To: Steven Rostedt, Mike Travis, Rusty Russell Cc: Chris Mason, Ma, Chinang, Andrew Morton, Matthew Wilcox, Wilcox, Matthew R, linux-kernel, Tripathi, Sharad C, arjan, Kleen, Andi, Siddha, Suresh B, Chilukuri, Harita, Styner, Douglas W, Wang, Peter Xihong, Nueckel, Hubert, linux-scsi, Andrew Vasquez, Anirban Chakraborty, Ingo Molnar, Thomas Gleixner, Peter Zijlstra, Gregory Haskins, Rusty Russell * Steven Rostedt <srostedt@redhat.com> wrote: > (added Rusty) > > On Mon, 2009-01-19 at 13:04 -0500, Chris Mason wrote: > > On Thu, 2009-01-15 at 00:11 -0700, Ma, Chinang wrote: > > > >> > > > > > > > >> > > > > Linux OLTP Performance summary > > > >> > > > > Kernel# Speedup(x) Intr/s CtxSw/s us% sys% idle% > > > >iowait% > > > >> > > > > 2.6.24.2 1.000 21969 43425 76 24 0 > > > >0 > > > >> > > > > 2.6.27.2 0.973 30402 43523 74 25 0 > > > >1 > > > >> > > > > 2.6.29-rc1 0.965 30331 41970 74 26 0 > > > >0 > > > >> > > > > >> > > But the interrupt rate went through the roof. > > > >> > > > > >> > Yes. I forget why that was; I'll have to dig through my archives for > > > >> > that. > > > >> > > > >> Oh. I'd have thought that this alone could account for 3.5%. > > > > A later email indicated the reschedule interrupt count doubled since > > 2.6.24, and so I poked around a bit at the causes of resched_task. > > > > I think the -rt version of check_preempt_equal_prio has gotten much more > > expensive since 2.6.24. > > > > I'm sure these changes were made for good reasons, and this workload may > > not be a good reason to change it back. But, what does the patch below > > do to performance on 2.6.29-rcX? > > > > -chris > > > > diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c > > index 954e1a8..bbe3492 100644 > > --- a/kernel/sched_rt.c > > +++ b/kernel/sched_rt.c > > @@ -842,6 +842,7 @@ static void check_preempt_curr_rt(struct rq *rq, > > struct task_struct *p, int sync > > resched_task(rq->curr); > > return; > > } > > + return; > > > > #ifdef CONFIG_SMP > > /* > > That should not cause much of a problem if the scheduling task is not > pinned to an CPU. But!!!!! > > A recent change makes it expensive: > > commit 24600ce89a819a8f2fb4fd69fd777218a82ade20 > Author: Rusty Russell <rusty@rustcorp.com.au> > Date: Tue Nov 25 02:35:13 2008 +1030 > > sched: convert check_preempt_equal_prio to cpumask_var_t. > > Impact: stack reduction for large NR_CPUS > > > > which has: > > static void check_preempt_equal_prio(struct rq *rq, struct task_struct > *p) > { > - cpumask_t mask; > + cpumask_var_t mask; > > if (rq->curr->rt.nr_cpus_allowed == 1) > return; > > - if (p->rt.nr_cpus_allowed != 1 > - && cpupri_find(&rq->rd->cpupri, p, &mask)) > + if (!alloc_cpumask_var(&mask, GFP_ATOMIC)) > return; > > > > > check_preempt_equal_prio is in a scheduling hot path!!!!! > > WTF are we allocating there for? Agreed - this needs to be fixed. Since this runs under the runqueue lock we can have a temporary cpumask in the runqueue itself, not on the stack. Ingo ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Mainline kernel OLTP performance update 2009-01-15 2:04 ` Andrew Morton 2009-01-15 2:27 ` Steven Rostedt @ 2009-01-15 2:39 ` Andi Kleen 2009-01-15 2:47 ` Matthew Wilcox 2009-01-15 7:24 ` Nick Piggin 2009-01-15 14:12 ` James Bottomley 3 siblings, 1 reply; 93+ messages in thread From: Andi Kleen @ 2009-01-15 2:39 UTC (permalink / raw) To: Andrew Morton Cc: Matthew Wilcox, Wilcox, Matthew R, chinang.ma, linux-kernel, sharad.c.tripathi, arjan, suresh.b.siddha, harita.chilukuri, douglas.w.styner, peter.xihong.wang, hubert.nueckel, chris.mason, srostedt, linux-scsi, Andrew Vasquez, Anirban Chakraborty Andrew Morton <akpm@linux-foundation.org> writes: >> some of that back, but not as much as taking them out (even when >> the sysctl'd variable is in a __read_mostly section). We tried a >> patch from Jens to speed up the search for a new partition, but it >> had no effect. > > I find this surprising. The test system has thousands of disks/LUNs which it writes to all the time, in addition to a workload which is a real cache pig. So any increase in the per LUN overhead directly leads to a lot more cache misses in the kernel because it increases the working set there sigificantly. > >> - The RT scheduler changes. They're better for some RT tasks, but not >> the database benchmark workload. Chinang has posted about >> this before, but the thread didn't really go anywhere. >> http://marc.info/?t=122903815000001&r=1&w=2 > > Well. It's more a case that it wasn't taken anywhere. I appear to > have recently been informed that there have never been any > CPU-scheduler-caused regressions. Please persist! Just to clarify: the non RT scheduler has never performed well on this workload (although it seems to get slightly worse too), mostly because of log writer starvation. RT at some point performed significantly better, but then as the RT behaviour was improved to be more fair on MP there were signficant regressions when running under RT. I wouldn't really advocate to make RT less fair again, it would be better to just fix the non RT scheduler to perform reasonably. Unfortunately the thread above which was supposed to do that didn't go anywhere. >> SLUB would have had a huge negative effect if we were using it -- on the >> order of 7% iirc. SLQB is at least performance-neutral with SLAB. > > We really need to unblock that problem somehow. I assume that > enterprise distros are shipping slab? The released ones all do. -Andi -- ak@linux.intel.com ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Mainline kernel OLTP performance update 2009-01-15 2:39 ` Andi Kleen @ 2009-01-15 2:47 ` Matthew Wilcox 2009-01-15 3:36 ` Andi Kleen 2009-01-20 13:27 ` Jens Axboe 0 siblings, 2 replies; 93+ messages in thread From: Matthew Wilcox @ 2009-01-15 2:47 UTC (permalink / raw) To: Andi Kleen Cc: Andrew Morton, Wilcox, Matthew R, chinang.ma, linux-kernel, sharad.c.tripathi, arjan, suresh.b.siddha, harita.chilukuri, douglas.w.styner, peter.xihong.wang, hubert.nueckel, chris.mason, srostedt, linux-scsi, Andrew Vasquez, Anirban Chakraborty On Thu, Jan 15, 2009 at 03:39:05AM +0100, Andi Kleen wrote: > Andrew Morton <akpm@linux-foundation.org> writes: > >> some of that back, but not as much as taking them out (even when > >> the sysctl'd variable is in a __read_mostly section). We tried a > >> patch from Jens to speed up the search for a new partition, but it > >> had no effect. > > > > I find this surprising. > > The test system has thousands of disks/LUNs which it writes to > all the time, in addition to a workload which is a real cache pig. > So any increase in the per LUN overhead directly leads to a lot > more cache misses in the kernel because it increases the working set > there sigificantly. This particular system has 450 spindles, but they're amalgamated into 30 logical volumes by the hardware or firmware. Linux sees 30 LUNs. Each one, though, has fifteen partitions on it, so that brings us back up to 450 partitions. This system, btw, is a scale model of the full system that would be used to get published results. If I remember correctly, a 1% performance regression on this system is likely to translate to a 2% regression on the full-scale system. -- Matthew Wilcox Intel Open Source Technology Centre "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Mainline kernel OLTP performance update 2009-01-15 2:47 ` Matthew Wilcox @ 2009-01-15 3:36 ` Andi Kleen 2009-01-20 13:27 ` Jens Axboe 1 sibling, 0 replies; 93+ messages in thread From: Andi Kleen @ 2009-01-15 3:36 UTC (permalink / raw) To: Matthew Wilcox Cc: Andi Kleen, Andrew Morton, Wilcox, Matthew R, chinang.ma, linux-kernel, sharad.c.tripathi, arjan, suresh.b.siddha, harita.chilukuri, douglas.w.styner, peter.xihong.wang, hubert.nueckel, chris.mason, srostedt, linux-scsi, Andrew Vasquez, Anirban Chakraborty > This particular system has 450 spindles, but they're amalgamated into > 30 logical volumes by the hardware or firmware. Linux sees 30 LUNs. > Each one, though, has fifteen partitions on it, so that brings us back > up to 450 partitions. Thanks for the correction. -Andi -- ak@linux.intel.com -- Speaking for myself only. ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Mainline kernel OLTP performance update 2009-01-15 2:47 ` Matthew Wilcox 2009-01-15 3:36 ` Andi Kleen @ 2009-01-20 13:27 ` Jens Axboe [not found] ` <588992150B702C48B3312184F1B810AD03A497632C@azsmsx501.amr.corp.intel.com> 1 sibling, 1 reply; 93+ messages in thread From: Jens Axboe @ 2009-01-20 13:27 UTC (permalink / raw) To: Matthew Wilcox Cc: Andi Kleen, Andrew Morton, Wilcox, Matthew R, chinang.ma, linux-kernel, sharad.c.tripathi, arjan, suresh.b.siddha, harita.chilukuri, douglas.w.styner, peter.xihong.wang, hubert.nueckel, chris.mason, srostedt, linux-scsi, Andrew Vasquez, Anirban Chakraborty On Wed, Jan 14 2009, Matthew Wilcox wrote: > On Thu, Jan 15, 2009 at 03:39:05AM +0100, Andi Kleen wrote: > > Andrew Morton <akpm@linux-foundation.org> writes: > > >> some of that back, but not as much as taking them out (even when > > >> the sysctl'd variable is in a __read_mostly section). We tried a > > >> patch from Jens to speed up the search for a new partition, but it > > >> had no effect. > > > > > > I find this surprising. > > > > The test system has thousands of disks/LUNs which it writes to > > all the time, in addition to a workload which is a real cache pig. > > So any increase in the per LUN overhead directly leads to a lot > > more cache misses in the kernel because it increases the working set > > there sigificantly. > > This particular system has 450 spindles, but they're amalgamated into > 30 logical volumes by the hardware or firmware. Linux sees 30 LUNs. > Each one, though, has fifteen partitions on it, so that brings us back > up to 450 partitions. > > This system, btw, is a scale model of the full system that would be used > to get published results. If I remember correctly, a 1% performance > regression on this system is likely to translate to a 2% regression on > the full-scale system. Matthew, lets see if we can get this a little closer to disappearing. I don't see lookup problems in the current kernel with the one-hit cache, but perhaps it's either not getting enough hits in this bigger test case or perhaps it's simply the rcu locking and preempt disables that build up enough to cause a slowdown. First things first, can you get a run of 2.6.29-rc2 with this patch? It'll enable you to turn off per-partition stats in sysfs. I'd suggest doing a run with a 2.6.29-rc2 booted with this patch, and then another run with part_stats set to 0 for every exposed spindle. Then post those profiles! diff --git a/block/blk-core.c b/block/blk-core.c index a824e49..6f693ae 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -600,7 +600,8 @@ blk_init_queue_node(request_fn_proc *rfn, spinlock_t *lock, int node_id) q->prep_rq_fn = NULL; q->unplug_fn = generic_unplug_device; q->queue_flags = (1 << QUEUE_FLAG_CLUSTER | - 1 << QUEUE_FLAG_STACKABLE); + 1 << QUEUE_FLAG_STACKABLE | + 1 << QUEUE_FLAG_PART_STAT); q->queue_lock = lock; blk_queue_segment_boundary(q, BLK_SEG_BOUNDARY_MASK); diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c index a29cb78..a6ec2e3 100644 --- a/block/blk-sysfs.c +++ b/block/blk-sysfs.c @@ -158,6 +158,29 @@ static ssize_t queue_rq_affinity_show(struct request_queue *q, char *page) return queue_var_show(set != 0, page); } +static ssize_t queue_part_stat_store(struct request_queue *q, const char *page, + size_t count) +{ + unsigned long nm; + ssize_t ret = queue_var_store(&nm, page, count); + + spin_lock_irq(q->queue_lock); + if (nm) + queue_flag_set(QUEUE_FLAG_PART_STAT, q); + else + queue_flag_clear(QUEUE_FLAG_PART_STAT, q); + + spin_unlock_irq(q->queue_lock); + return ret; +} + +static ssize_t queue_part_stat_show(struct request_queue *q, char *page) +{ + unsigned int set = test_bit(QUEUE_FLAG_PART_STAT, &q->queue_flags); + + return queue_var_show(set != 0, page); +} + static ssize_t queue_rq_affinity_store(struct request_queue *q, const char *page, size_t count) { @@ -222,6 +245,12 @@ static struct queue_sysfs_entry queue_rq_affinity_entry = { .store = queue_rq_affinity_store, }; +static struct queue_sysfs_entry queue_part_stat_entry = { + .attr = {.name = "part_stats", .mode = S_IRUGO | S_IWUSR }, + .show = queue_part_stat_show, + .store = queue_part_stat_store, +}; + static struct attribute *default_attrs[] = { &queue_requests_entry.attr, &queue_ra_entry.attr, @@ -231,6 +260,7 @@ static struct attribute *default_attrs[] = { &queue_hw_sector_size_entry.attr, &queue_nomerges_entry.attr, &queue_rq_affinity_entry.attr, + &queue_part_stat_entry.attr, NULL, }; diff --git a/block/genhd.c b/block/genhd.c index 397960c..09cbac2 100644 --- a/block/genhd.c +++ b/block/genhd.c @@ -208,6 +208,9 @@ struct hd_struct *disk_map_sector_rcu(struct gendisk *disk, sector_t sector) struct hd_struct *part; int i; + if (!blk_queue_part_stat(disk->queue)) + goto part0; + ptbl = rcu_dereference(disk->part_tbl); part = rcu_dereference(ptbl->last_lookup); @@ -222,6 +225,7 @@ struct hd_struct *disk_map_sector_rcu(struct gendisk *disk, sector_t sector) return part; } } +part0: return &disk->part0; } EXPORT_SYMBOL_GPL(disk_map_sector_rcu); diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 044467e..4d45842 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -449,6 +449,7 @@ struct request_queue #define QUEUE_FLAG_STACKABLE 13 /* supports request stacking */ #define QUEUE_FLAG_NONROT 14 /* non-rotational device (SSD) */ #define QUEUE_FLAG_VIRT QUEUE_FLAG_NONROT /* paravirt device */ +#define QUEUE_FLAG_PART_STAT 15 /* per-partition stats enabled */ static inline int queue_is_locked(struct request_queue *q) { @@ -568,6 +569,8 @@ enum { #define blk_queue_flushing(q) ((q)->ordseq) #define blk_queue_stackable(q) \ test_bit(QUEUE_FLAG_STACKABLE, &(q)->queue_flags) +#define blk_queue_part_stat(q) \ + test_bit(QUEUE_FLAG_PART_STAT, &(q)->queue_flags) #define blk_fs_request(rq) ((rq)->cmd_type == REQ_TYPE_FS) #define blk_pc_request(rq) ((rq)->cmd_type == REQ_TYPE_BLOCK_PC) -- Jens Axboe ^ permalink raw reply related [flat|nested] 93+ messages in thread
[parent not found: <588992150B702C48B3312184F1B810AD03A497632C@azsmsx501.amr.corp.intel.com>]
* Re: Mainline kernel OLTP performance update [not found] ` <588992150B702C48B3312184F1B810AD03A497632C@azsmsx501.amr.corp.intel.com> @ 2009-01-22 11:29 ` Jens Axboe [not found] ` <588992150B702C48B3312184F1B810AD03A4F59632@azsmsx501.amr.corp.intel.com> 0 siblings, 1 reply; 93+ messages in thread From: Jens Axboe @ 2009-01-22 11:29 UTC (permalink / raw) To: Chilukuri, Harita Cc: Matthew Wilcox, Andi Kleen, Andrew Morton, Wilcox, Matthew R, Ma, Chinang, linux-kernel, Tripathi, Sharad C, arjan, Siddha, Suresh B, Styner, Douglas W, Wang, Peter Xihong, Nueckel, Hubert, chris.mason, srostedt, linux-scsi, Andrew Vasquez, Anirban Chakraborty On Wed, Jan 21 2009, Chilukuri, Harita wrote: > Jen, we work with Matthew on the OLTP workload and have tested the part_stats patch on 2.6.29-rc2. Below are the details: > > Disabling the part_stats has positive impact on the OLTP workload. > > Linux OLTP Performance summary > Kernel# Speedup(x) Intr/s CtxSw/s us% sys% idle% iowait% > 2.6.29-rc2-part_stats 1.000 30329 41716 74 26 0 0 > 2.6.29-rc2-disable-part_stats 1.006 30413 42582 74 25 0 0 > > Server configurations: > Intel Xeon Quad-core 2.0GHz 2 cpus/8 cores/8 threads > 64GB memory, 3 qle2462 FC HBA, 450 spindles (30 logical units) > > > ======oprofile CPU_CLK_UNHALTED for top 30 functions > Cycles% 2.6.29-rc2-part_stats Cycles% 2.6.29-rc2-disable-part_stats > 0.9634 qla24xx_intr_handler 1.0372 qla24xx_intr_handler > 0.9057 copy_user_generic_string 0.7461 qla24xx_wrt_req_reg > 0.7583 unmap_vmas 0.7130 kmem_cache_alloc > 0.6280 qla24xx_wrt_req_reg 0.6876 copy_user_generic_string > 0.6088 kmem_cache_alloc 0.5656 qla24xx_start_scsi > 0.5468 clear_page_c 0.4881 __blockdev_direct_IO > 0.5191 qla24xx_start_scsi 0.4728 try_to_wake_up > 0.4892 try_to_wake_up 0.4588 unmap_vmas > 0.4870 __blockdev_direct_IO 0.4360 scsi_request_fn > 0.4187 scsi_request_fn 0.3711 __switch_to > 0.3717 __switch_to 0.3699 aio_complete > 0.3567 rb_get_reader_page 0.3648 rb_get_reader_page > 0.3396 aio_complete 0.3597 ring_buffer_consume > 0.3012 __end_that_request_first 0.3292 memset_c > 0.2926 memset_c 0.3076 __list_add > 0.2926 ring_buffer_consume 0.2771 clear_page_c > 0.2884 page_remove_rmap 0.2745 task_rq_lock > 0.2691 disk_map_sector_rcu 0.2733 generic_make_request > 0.2670 copy_page_c 0.2555 tcp_sendmsg > 0.2670 lock_timer_base 0.2529 qla2x00_process_completed_re > 0.2606 qla2x00_process_completed_re0.2440 e1000_xmit_frame > 0.2521 task_rq_lock 0.2390 lock_timer_base > 0.2328 __list_add 0.2364 qla24xx_queuecommand > 0.2286 generic_make_request 0.2301 kmem_cache_free > 0.2286 pick_next_highest_task_rt 0.2262 blk_queue_end_tag > 0.2136 push_rt_task 0.2262 kref_get > 0.2115 blk_queue_end_tag 0.2250 push_rt_task > 0.2115 kmem_cache_free 0.2135 scsi_dispatch_cmd > 0.2051 e1000_xmit_frame 0.2084 sd_prep_fn > 0.2051 scsi_device_unbusy 0.2059 kfree Alright, so that 0.6%. IIRC, 0.1% (or there abouts) is significant with this benchmark, correct? To get a feel for the rest of the accounting overhead, could you try with this patch that just disables the whole thing? diff --git a/block/blk-core.c b/block/blk-core.c index a824e49..eec9126 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -64,6 +64,7 @@ static struct workqueue_struct *kblockd_workqueue; static void drive_stat_acct(struct request *rq, int new_io) { +#if 0 struct hd_struct *part; int rw = rq_data_dir(rq); int cpu; @@ -82,6 +83,7 @@ static void drive_stat_acct(struct request *rq, int new_io) } part_stat_unlock(); +#endif } void blk_queue_congestion_threshold(struct request_queue *q) @@ -1014,6 +1017,7 @@ static inline void add_request(struct request_queue *q, struct request *req) __elv_add_request(q, req, ELEVATOR_INSERT_SORT, 0); } +#if 0 static void part_round_stats_single(int cpu, struct hd_struct *part, unsigned long now) { @@ -1027,6 +1031,7 @@ static void part_round_stats_single(int cpu, struct hd_struct *part, } part->stamp = now; } +#endif /** * part_round_stats() - Round off the performance stats on a struct disk_stats. @@ -1046,11 +1051,13 @@ static void part_round_stats_single(int cpu, struct hd_struct *part, */ void part_round_stats(int cpu, struct hd_struct *part) { +#if 0 unsigned long now = jiffies; if (part->partno) part_round_stats_single(cpu, &part_to_disk(part)->part0, now); part_round_stats_single(cpu, part, now); +#endif } EXPORT_SYMBOL_GPL(part_round_stats); @@ -1690,6 +1697,7 @@ static int __end_that_request_first(struct request *req, int error, (unsigned long long)req->sector); } +#if 0 if (blk_fs_request(req) && req->rq_disk) { const int rw = rq_data_dir(req); struct hd_struct *part; @@ -1700,6 +1708,7 @@ static int __end_that_request_first(struct request *req, int error, part_stat_add(cpu, part, sectors[rw], nr_bytes >> 9); part_stat_unlock(); } +#endif total_bytes = bio_nbytes = 0; while ((bio = req->bio) != NULL) { @@ -1779,7 +1788,9 @@ static int __end_that_request_first(struct request *req, int error, */ static void end_that_request_last(struct request *req, int error) { +#if 0 struct gendisk *disk = req->rq_disk; +#endif if (blk_rq_tagged(req)) blk_queue_end_tag(req->q, req); @@ -1797,6 +1808,7 @@ static void end_that_request_last(struct request *req, int error) * IO on queueing nor completion. Accounting the containing * request is enough. */ +#if 0 if (disk && blk_fs_request(req) && req != &req->q->bar_rq) { unsigned long duration = jiffies - req->start_time; const int rw = rq_data_dir(req); @@ -1813,6 +1825,7 @@ static void end_that_request_last(struct request *req, int error) part_stat_unlock(); } +#endif if (req->end_io) req->end_io(req, error); -- Jens Axboe ^ permalink raw reply related [flat|nested] 93+ messages in thread
[parent not found: <588992150B702C48B3312184F1B810AD03A4F59632@azsmsx501.amr.corp.intel.com>]
* Re: Mainline kernel OLTP performance update [not found] ` <588992150B702C48B3312184F1B810AD03A4F59632@azsmsx501.amr.corp.intel.com> @ 2009-01-27 8:28 ` Jens Axboe 0 siblings, 0 replies; 93+ messages in thread From: Jens Axboe @ 2009-01-27 8:28 UTC (permalink / raw) To: Chilukuri, Harita Cc: Matthew Wilcox, Andi Kleen, Andrew Morton, Wilcox, Matthew R, Ma, Chinang, linux-kernel, Tripathi, Sharad C, arjan, Siddha, Suresh B, Styner, Douglas W, Wang, Peter Xihong, Nueckel, Hubert, chris.mason, srostedt, linux-scsi, Andrew Vasquez, Anirban Chakraborty On Mon, Jan 26 2009, Chilukuri, Harita wrote: > Jens, we did test the patch that disables the whole stats. We get 0.5% gain with this patch on 2.6.29-rc2 comparing to 2.6.29-rc2-disbale_part_stats > > Below is the description of the result: > > Linux OLTP Performance summary > Kernel# Speedup(x) Intr/s CtxSw/s us% sys% idle% iowait% > 2.6.29-rc2-disbale_partition_stats 1.000 30413 42582 74 25 0 0 > 2.6.29-rc2-disable_all 1.005 30401 42656 74 25 0 0 > > Server configurations: > Intel Xeon Quad-core 2.0GHz 2 cpus/8 cores/8 threads > 64GB memory, 3 qle2462 FC HBA, 450 spindles (30 logical units) OK, so about the same, which means the lookup is likely the expensive bit. I have merged this patch: http://git.kernel.dk/?p=linux-2.6-block.git;a=commit;h=e5b74b703da41fab060adc335a0b98fa5a5ea61d which exposes an 'iostats' toggle that allows users to disable disk statistics completely. -- Jens Axboe ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Mainline kernel OLTP performance update 2009-01-15 2:04 ` Andrew Morton 2009-01-15 2:27 ` Steven Rostedt 2009-01-15 2:39 ` Andi Kleen @ 2009-01-15 7:24 ` Nick Piggin 2009-01-15 9:46 ` Pekka Enberg 2009-01-16 0:27 ` Andrew Morton 2009-01-15 14:12 ` James Bottomley 3 siblings, 2 replies; 93+ messages in thread From: Nick Piggin @ 2009-01-15 7:24 UTC (permalink / raw) To: Andrew Morton Cc: Matthew Wilcox, Wilcox, Matthew R, chinang.ma, linux-kernel, sharad.c.tripathi, arjan, andi.kleen, suresh.b.siddha, harita.chilukuri, douglas.w.styner, peter.xihong.wang, hubert.nueckel, chris.mason, srostedt, linux-scsi, Andrew Vasquez, Anirban Chakraborty On Thursday 15 January 2009 13:04:31 Andrew Morton wrote: > On Wed, 14 Jan 2009 18:21:47 -0700 Matthew Wilcox <matthew@wil.cx> wrote: > > SLUB would have had a huge negative effect if we were using it -- on the > > order of 7% iirc. SLQB is at least performance-neutral with SLAB. > > We really need to unblock that problem somehow. I assume that > enterprise distros are shipping slab? SLES11 will ship with SLAB, FWIW. As I said in the SLQB thread, this was not due to my input. But I think it was probably the right choice to make in that situation. The biggest problem with SLAB for SGI I think is alien caches bloating the kmem cache footprint to many GB each on their huge systems, but SLAB has a parameter to turn off alien caches anyway so I think that is a reasonable workaround. Given the OLTP regression, and also I'd hate to have to deal with even more reports of people's order-N allocations failing... basically with the regression potential there, I don't think there was a compelling case found to use SLUB (ie. where does it actually help?). I'm going to propose to try to unblock the problem by asking to merge SLQB with a plan to end up picking just one general allocator (and SLOB). Given that SLAB and SLUB are fairly mature, I wonder what you'd think of taking SLQB into -mm and making it the default there for a while, to see if anybody reports a problem? ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Mainline kernel OLTP performance update 2009-01-15 7:24 ` Nick Piggin @ 2009-01-15 9:46 ` Pekka Enberg 2009-01-15 13:52 ` Matthew Wilcox 2009-01-16 0:27 ` Andrew Morton 1 sibling, 1 reply; 93+ messages in thread From: Pekka Enberg @ 2009-01-15 9:46 UTC (permalink / raw) To: Nick Piggin Cc: Andrew Morton, Matthew Wilcox, Wilcox, Matthew R, chinang.ma, linux-kernel, sharad.c.tripathi, arjan, andi.kleen, suresh.b.siddha, harita.chilukuri, douglas.w.styner, peter.xihong.wang, hubert.nueckel, chris.mason, srostedt, linux-scsi, Andrew Vasquez, Anirban Chakraborty On Thu, Jan 15, 2009 at 9:24 AM, Nick Piggin <nickpiggin@yahoo.com.au> wrote: > SLES11 will ship with SLAB, FWIW. As I said in the SLQB thread, this was > not due to my input. But I think it was probably the right choice to make > in that situation. > > The biggest problem with SLAB for SGI I think is alien caches bloating the > kmem cache footprint to many GB each on their huge systems, but SLAB has a > parameter to turn off alien caches anyway so I think that is a reasonable > workaround. > > Given the OLTP regression, and also I'd hate to have to deal with even > more reports of people's order-N allocations failing... basically with the > regression potential there, I don't think there was a compelling case > found to use SLUB (ie. where does it actually help?). > > I'm going to propose to try to unblock the problem by asking to merge SLQB > with a plan to end up picking just one general allocator (and SLOB). It would also be nice if someone could do the performance analysis on the SLUB bug. I ran sysbench in oltp mode here and the results look like this: [ number of transactions per second from 10 runs. ] min max avg sd 2.6.29-rc1-slab 833.77 852.32 845.10 4.72 2.6.29-rc1-slub 823.61 851.94 836.74 8.57 I used the following sysbench parameters: sysbench --test=oltp \ --oltp-table-size=1000000 \ --mysql-socket=/var/run/mysqld/mysqld.sock \ prepare sysbench --num-threads=16 \ --max-requests=100000 \ --test=oltp --oltp-table-size=1000000 \ --mysql-socket=/var/run/mysqld/mysqld.sock \ --oltp-read-only run And no, the numbers are not flipped, SLUB beats SLAB here. :( Pekka $ mysql --version mysql Ver 14.12 Distrib 5.0.51a, for debian-linux-gnu (x86_64) using readline 5.2 $ cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 15 model name : Intel(R) Core(TM)2 CPU T7200 @ 2.00GHz stepping : 6 cpu MHz : 1000.000 cache size : 4096 KB physical id : 0 siblings : 2 core id : 0 cpu cores : 2 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 10 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm lahf_lm tpr_shadow bogomips : 3989.99 clflush size : 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual power management: processor : 1 vendor_id : GenuineIntel cpu family : 6 model : 15 model name : Intel(R) Core(TM)2 CPU T7200 @ 2.00GHz stepping : 6 cpu MHz : 1000.000 cache size : 4096 KB physical id : 0 siblings : 2 core id : 1 cpu cores : 2 apicid : 1 initial apicid : 1 fpu : yes fpu_exception : yes cpuid level : 10 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm lahf_lm tpr_shadow bogomips : 3990.04 clflush size : 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual power management: $ lspci 00:00.0 Host bridge: Intel Corporation Mobile 945GM/PM/GMS, 943/940GML and 945GT Express Memory Controller Hub (rev 03) 00:02.0 VGA compatible controller: Intel Corporation Mobile 945GM/GMS, 943/940GML Express Integrated Graphics Controller (rev 03) 00:02.1 Display controller: Intel Corporation Mobile 945GM/GMS/GME, 943/940GML Express Integrated Graphics Controller (rev 03) 00:07.0 Performance counters: Intel Corporation Unknown device 27a3 (rev 03) 00:1b.0 Audio device: Intel Corporation 82801G (ICH7 Family) High Definition Audio Controller (rev 02) 00:1c.0 PCI bridge: Intel Corporation 82801G (ICH7 Family) PCI Express Port 1 (rev 02) 00:1c.1 PCI bridge: Intel Corporation 82801G (ICH7 Family) PCI Express Port 2 (rev 02) 00:1d.0 USB Controller: Intel Corporation 82801G (ICH7 Family) USB UHCI Controller #1 (rev 02) 00:1d.1 USB Controller: Intel Corporation 82801G (ICH7 Family) USB UHCI Controller #2 (rev 02) 00:1d.2 USB Controller: Intel Corporation 82801G (ICH7 Family) USB UHCI Controller #3 (rev 02) 00:1d.3 USB Controller: Intel Corporation 82801G (ICH7 Family) USB UHCI Controller #4 (rev 02) 00:1d.7 USB Controller: Intel Corporation 82801G (ICH7 Family) USB2 EHCI Controller (rev 02) 00:1e.0 PCI bridge: Intel Corporation 82801 Mobile PCI Bridge (rev e2) 00:1f.0 ISA bridge: Intel Corporation 82801GBM (ICH7-M) LPC Interface Bridge (rev 02) 00:1f.1 IDE interface: Intel Corporation 82801G (ICH7 Family) IDE Controller (rev 02) 00:1f.2 IDE interface: Intel Corporation 82801GBM/GHM (ICH7 Family) SATA IDE Controller (rev 02) 00:1f.3 SMBus: Intel Corporation 82801G (ICH7 Family) SMBus Controller (rev 02) 01:00.0 Ethernet controller: Marvell Technology Group Ltd. 88E8053 PCI-E Gigabit Ethernet Controller (rev 22) 02:00.0 Network controller: Atheros Communications Inc. AR5418 802.11abgn Wireless PCI Express Adapter (rev 01) 03:03.0 FireWire (IEEE 1394): Agere Systems FW323 (rev 61) ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Mainline kernel OLTP performance update 2009-01-15 9:46 ` Pekka Enberg @ 2009-01-15 13:52 ` Matthew Wilcox 2009-01-15 14:42 ` Pekka Enberg 2009-01-16 10:16 ` Pekka Enberg 0 siblings, 2 replies; 93+ messages in thread From: Matthew Wilcox @ 2009-01-15 13:52 UTC (permalink / raw) To: Pekka Enberg Cc: Nick Piggin, Andrew Morton, Wilcox, Matthew R, chinang.ma, linux-kernel, sharad.c.tripathi, arjan, andi.kleen, suresh.b.siddha, harita.chilukuri, douglas.w.styner, peter.xihong.wang, hubert.nueckel, chris.mason, srostedt, linux-scsi, Andrew Vasquez, Anirban Chakraborty On Thu, Jan 15, 2009 at 11:46:09AM +0200, Pekka Enberg wrote: > It would also be nice if someone could do the performance analysis on > the SLUB bug. I ran sysbench in oltp mode here and the results look > like this: > > [ number of transactions per second from 10 runs. ] > > min max avg sd > 2.6.29-rc1-slab 833.77 852.32 845.10 4.72 > 2.6.29-rc1-slub 823.61 851.94 836.74 8.57 > > And no, the numbers are not flipped, SLUB beats SLAB here. :( Um. More transactions per second is good. Your numbers show SLAB beating SLUB (even on your dual-CPU system). And SLAB shows a lower standard deviation, which is also good. -- Matthew Wilcox Intel Open Source Technology Centre "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Mainline kernel OLTP performance update 2009-01-15 13:52 ` Matthew Wilcox @ 2009-01-15 14:42 ` Pekka Enberg 2009-01-16 10:16 ` Pekka Enberg 1 sibling, 0 replies; 93+ messages in thread From: Pekka Enberg @ 2009-01-15 14:42 UTC (permalink / raw) To: Matthew Wilcox Cc: Nick Piggin, Andrew Morton, Wilcox, Matthew R, chinang.ma, linux-kernel, sharad.c.tripathi, arjan, andi.kleen, suresh.b.siddha, harita.chilukuri, douglas.w.styner, peter.xihong.wang, hubert.nueckel, chris.mason, srostedt, linux-scsi, Andrew Vasquez, Anirban Chakraborty Matthew Wilcox wrote: > On Thu, Jan 15, 2009 at 11:46:09AM +0200, Pekka Enberg wrote: >> It would also be nice if someone could do the performance analysis on >> the SLUB bug. I ran sysbench in oltp mode here and the results look >> like this: >> >> [ number of transactions per second from 10 runs. ] >> >> min max avg sd >> 2.6.29-rc1-slab 833.77 852.32 845.10 4.72 >> 2.6.29-rc1-slub 823.61 851.94 836.74 8.57 >> >> And no, the numbers are not flipped, SLUB beats SLAB here. :( > > Um. More transactions per second is good. Your numbers show SLAB > beating SLUB (even on your dual-CPU system). And SLAB shows a lower > standard deviation, which is also good. *blush* Will do oprofile tomorrow. Thanks Matthew. ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Mainline kernel OLTP performance update 2009-01-15 13:52 ` Matthew Wilcox 2009-01-15 14:42 ` Pekka Enberg @ 2009-01-16 10:16 ` Pekka Enberg 2009-01-16 10:21 ` Nick Piggin 1 sibling, 1 reply; 93+ messages in thread From: Pekka Enberg @ 2009-01-16 10:16 UTC (permalink / raw) To: Matthew Wilcox Cc: Nick Piggin, Andrew Morton, Wilcox, Matthew R, chinang.ma, linux-kernel, sharad.c.tripathi, arjan, andi.kleen, suresh.b.siddha, harita.chilukuri, douglas.w.styner, peter.xihong.wang, hubert.nueckel, chris.mason, srostedt, linux-scsi, Andrew Vasquez, Anirban Chakraborty, Christoph Lameter On Thu, Jan 15, 2009 at 11:46:09AM +0200, Pekka Enberg wrote: >> It would also be nice if someone could do the performance analysis on >> the SLUB bug. I ran sysbench in oltp mode here and the results look >> like this: >> >> [ number of transactions per second from 10 runs. ] >> >> min max avg sd >> 2.6.29-rc1-slab 833.77 852.32 845.10 4.72 >> 2.6.29-rc1-slub 823.61 851.94 836.74 8.57 >> >> And no, the numbers are not flipped, SLUB beats SLAB here. :( On Thu, Jan 15, 2009 at 3:52 PM, Matthew Wilcox <matthew@wil.cx> wrote: > Um. More transactions per second is good. Your numbers show SLAB > beating SLUB (even on your dual-CPU system). And SLAB shows a lower > standard deviation, which is also good. I had lockdep enabled in my config so I ran the tests again with x86-64 defconfig and I'm back to square one: [ number of transactions per second from 10 runs, bigger is better ] min max avg sd 2.6.29-rc1-slab 802.02 805.37 803.93 0.97 2.6.29-rc1-slub 807.78 811.20 809.86 1.05 Pekka ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Mainline kernel OLTP performance update 2009-01-16 10:16 ` Pekka Enberg @ 2009-01-16 10:21 ` Nick Piggin 2009-01-16 10:31 ` Pekka Enberg 0 siblings, 1 reply; 93+ messages in thread From: Nick Piggin @ 2009-01-16 10:21 UTC (permalink / raw) To: Pekka Enberg Cc: Matthew Wilcox, Andrew Morton, Wilcox, Matthew R, chinang.ma, linux-kernel, sharad.c.tripathi, arjan, andi.kleen, suresh.b.siddha, harita.chilukuri, douglas.w.styner, peter.xihong.wang, hubert.nueckel, chris.mason, srostedt, linux-scsi, Andrew Vasquez, Anirban Chakraborty, Christoph Lameter On Friday 16 January 2009 21:16:31 Pekka Enberg wrote: > On Thu, Jan 15, 2009 at 11:46:09AM +0200, Pekka Enberg wrote: > >> It would also be nice if someone could do the performance analysis on > >> the SLUB bug. I ran sysbench in oltp mode here and the results look > >> like this: > >> > >> [ number of transactions per second from 10 runs. ] > >> > >> min max avg sd > >> 2.6.29-rc1-slab 833.77 852.32 845.10 4.72 > >> 2.6.29-rc1-slub 823.61 851.94 836.74 8.57 > I had lockdep enabled in my config so I ran the tests again with > x86-64 defconfig and I'm back to square one: > > [ number of transactions per second from 10 runs, bigger is better ] > > min max avg sd > 2.6.29-rc1-slab 802.02 805.37 803.93 0.97 > 2.6.29-rc1-slub 807.78 811.20 809.86 1.05 Hm, I wonder why it is going slower with lockdep disabled? Did something else change? ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Mainline kernel OLTP performance update 2009-01-16 10:21 ` Nick Piggin @ 2009-01-16 10:31 ` Pekka Enberg 2009-01-16 10:42 ` Nick Piggin 2009-01-16 20:59 ` Christoph Lameter 0 siblings, 2 replies; 93+ messages in thread From: Pekka Enberg @ 2009-01-16 10:31 UTC (permalink / raw) To: Nick Piggin Cc: Matthew Wilcox, Andrew Morton, Wilcox, Matthew R, chinang.ma, linux-kernel, sharad.c.tripathi, arjan, andi.kleen, suresh.b.siddha, harita.chilukuri, douglas.w.styner, peter.xihong.wang, hubert.nueckel, chris.mason, srostedt, linux-scsi, Andrew Vasquez, Anirban Chakraborty, Christoph Lameter On Friday 16 January 2009 21:16:31 Pekka Enberg wrote: >> I had lockdep enabled in my config so I ran the tests again with >> x86-64 defconfig and I'm back to square one: >> >> [ number of transactions per second from 10 runs, bigger is better ] >> >> min max avg sd >> 2.6.29-rc1-slab 802.02 805.37 803.93 0.97 >> 2.6.29-rc1-slub 807.78 811.20 809.86 1.05 On Fri, Jan 16, 2009 at 12:21 PM, Nick Piggin <nickpiggin@yahoo.com.au> wrote: > Hm, I wonder why it is going slower with lockdep disabled? > Did something else change? I don't have the exact config for the previous tests but it's was just my laptop regular config whereas the new tests are x86-64 defconfig. So I think I'm just hitting some of the other OLTP regressions here, aren't I? There's some scheduler related options such as CONFIG_GROUP_SCHED and CONFIG_FAIR_GROUP_SCHED enabled in defconfig that I didn't have in the original tests. I can try without them if you want but I'm not sure it's relevant for SLAB vs SLUB tests. Pekka ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Mainline kernel OLTP performance update 2009-01-16 10:31 ` Pekka Enberg @ 2009-01-16 10:42 ` Nick Piggin 2009-01-16 10:55 ` Pekka Enberg 2009-01-16 20:59 ` Christoph Lameter 1 sibling, 1 reply; 93+ messages in thread From: Nick Piggin @ 2009-01-16 10:42 UTC (permalink / raw) To: Pekka Enberg Cc: Matthew Wilcox, Andrew Morton, Wilcox, Matthew R, chinang.ma, linux-kernel, sharad.c.tripathi, arjan, andi.kleen, suresh.b.siddha, harita.chilukuri, douglas.w.styner, peter.xihong.wang, hubert.nueckel, chris.mason, srostedt, linux-scsi, Andrew Vasquez, Anirban Chakraborty, Christoph Lameter On Friday 16 January 2009 21:31:03 Pekka Enberg wrote: > On Friday 16 January 2009 21:16:31 Pekka Enberg wrote: > >> I had lockdep enabled in my config so I ran the tests again with > >> x86-64 defconfig and I'm back to square one: > >> > >> [ number of transactions per second from 10 runs, bigger is better ] > >> > >> min max avg sd > >> 2.6.29-rc1-slab 802.02 805.37 803.93 0.97 > >> 2.6.29-rc1-slub 807.78 811.20 809.86 1.05 > > On Fri, Jan 16, 2009 at 12:21 PM, Nick Piggin <nickpiggin@yahoo.com.au> wrote: > > Hm, I wonder why it is going slower with lockdep disabled? > > Did something else change? > > I don't have the exact config for the previous tests but it's was just > my laptop regular config whereas the new tests are x86-64 defconfig. > So I think I'm just hitting some of the other OLTP regressions here, > aren't I? There's some scheduler related options such as > CONFIG_GROUP_SCHED and CONFIG_FAIR_GROUP_SCHED enabled in defconfig > that I didn't have in the original tests. I can try without them if > you want but I'm not sure it's relevant for SLAB vs SLUB tests. Oh no that's fine. It just looked like you repeated the test but with lockdep disabled (and no other changes). ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Mainline kernel OLTP performance update 2009-01-16 10:42 ` Nick Piggin @ 2009-01-16 10:55 ` Pekka Enberg 2009-01-19 7:13 ` Nick Piggin 0 siblings, 1 reply; 93+ messages in thread From: Pekka Enberg @ 2009-01-16 10:55 UTC (permalink / raw) To: Nick Piggin Cc: Matthew Wilcox, Andrew Morton, Wilcox, Matthew R, chinang.ma, linux-kernel, sharad.c.tripathi, arjan, andi.kleen, suresh.b.siddha, harita.chilukuri, douglas.w.styner, peter.xihong.wang, hubert.nueckel, chris.mason, srostedt, linux-scsi, Andrew Vasquez, Anirban Chakraborty, Christoph Lameter Hi Nick, On Fri, Jan 16, 2009 at 12:42 PM, Nick Piggin <nickpiggin@yahoo.com.au> wrote: >> I don't have the exact config for the previous tests but it's was just >> my laptop regular config whereas the new tests are x86-64 defconfig. >> So I think I'm just hitting some of the other OLTP regressions here, >> aren't I? There's some scheduler related options such as >> CONFIG_GROUP_SCHED and CONFIG_FAIR_GROUP_SCHED enabled in defconfig >> that I didn't have in the original tests. I can try without them if >> you want but I'm not sure it's relevant for SLAB vs SLUB tests. > > Oh no that's fine. It just looked like you repeated the test but > with lockdep disabled (and no other changes). Right. In any case, I am still unable to reproduce the OLTP issue and I've seen SLUB beat SLAB on my machine in most of the benchmarks you've posted. So I have very mixed feelings about SLQB. It's very nice that it works for OLTP but we still don't have much insight (i.e. numbers) on why it's better. I'm also bit worried if SLQB has gotten enough attention from the NUMA and HPC folks that brought us SLUB. The good news is that SLQB can replace SLAB so either way, we're not going to end up with four allocators. Whether it can replace SLUB remains to be seen. Pekka ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Mainline kernel OLTP performance update 2009-01-16 10:55 ` Pekka Enberg @ 2009-01-19 7:13 ` Nick Piggin 2009-01-19 8:05 ` Pekka Enberg 0 siblings, 1 reply; 93+ messages in thread From: Nick Piggin @ 2009-01-19 7:13 UTC (permalink / raw) To: Pekka Enberg Cc: Matthew Wilcox, Andrew Morton, Wilcox, Matthew R, chinang.ma, linux-kernel, sharad.c.tripathi, arjan, andi.kleen, suresh.b.siddha, harita.chilukuri, douglas.w.styner, peter.xihong.wang, hubert.nueckel, chris.mason, srostedt, linux-scsi, Andrew Vasquez, Anirban Chakraborty, Christoph Lameter On Friday 16 January 2009 21:55:30 Pekka Enberg wrote: > Hi Nick, > > On Fri, Jan 16, 2009 at 12:42 PM, Nick Piggin <nickpiggin@yahoo.com.au> wrote: > >> I don't have the exact config for the previous tests but it's was just > >> my laptop regular config whereas the new tests are x86-64 defconfig. > >> So I think I'm just hitting some of the other OLTP regressions here, > >> aren't I? There's some scheduler related options such as > >> CONFIG_GROUP_SCHED and CONFIG_FAIR_GROUP_SCHED enabled in defconfig > >> that I didn't have in the original tests. I can try without them if > >> you want but I'm not sure it's relevant for SLAB vs SLUB tests. > > > > Oh no that's fine. It just looked like you repeated the test but > > with lockdep disabled (and no other changes). > > Right. In any case, I am still unable to reproduce the OLTP issue and > I've seen SLUB beat SLAB on my machine in most of the benchmarks > you've posted. SLUB was distinctly slower on the tbench, netperf, and hackbench tests that I ran. These were faster with SLUB on your machine? What kind of system is it? > So I have very mixed feelings about SLQB. It's very > nice that it works for OLTP but we still don't have much insight (i.e. > numbers) on why it's better. According to estimates in this thread, I think Matthew said SLUB would be around 6% slower? SLQB is within measurement error of SLAB. Fair point about personally reproducing the OLTP problem yourself. But the fact is that we will get problem reports that cannot be reproduced. That does not make them less relevant. I can't reproduce the OLTP benchmark myself. And I'm fully expecting to get problem reports for SLQB against insanely sized SGI systems, which I will take very seriously and try to fix them. > I'm also bit worried if SLQB has gotten > enough attention from the NUMA and HPC folks that brought us SLUB. It hasn't, but that's the problem we're hoping to solve by getting it merged. People can give it more attention, and we can try to fix any problems. SLUB has been default for quite a while now and not able to solve all problems it has had reported against it. So I hope SLQB will be able to unblock this situation. > The good news is that SLQB can replace SLAB so either way, we're not > going to end up with four allocators. Whether it can replace SLUB > remains to be seen. Well I think being able to simply replace SLAB is not ideal. The plan I'm hoping is to have four allocators for a few releases, and then go back to having two. That is going to mean some groups might not have their ideal allocator merged... but I think it is crazy to settle with more than one main compile-time allocator for the long term. I don't know what the next redhat enterprise release is going to do, but if they go with SLAB, then I think that means no SGI systems would run in production with SLUB anyway, so what would be the purpose of having a special "HPC/huge system" allocator? Or... what other reasons should users select SLUB vs SLAB? (in terms of core allocator behaviour, versus extras that can be ported from one to the other) If we can't even make up our own minds, then will others be able to? ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Mainline kernel OLTP performance update 2009-01-19 7:13 ` Nick Piggin @ 2009-01-19 8:05 ` Pekka Enberg 2009-01-19 8:33 ` Nick Piggin 0 siblings, 1 reply; 93+ messages in thread From: Pekka Enberg @ 2009-01-19 8:05 UTC (permalink / raw) To: Nick Piggin Cc: Matthew Wilcox, Andrew Morton, Wilcox, Matthew R, chinang.ma, linux-kernel, sharad.c.tripathi, arjan, andi.kleen, suresh.b.siddha, harita.chilukuri, douglas.w.styner, peter.xihong.wang, hubert.nueckel, chris.mason, srostedt, linux-scsi, Andrew Vasquez, Anirban Chakraborty, Christoph Lameter Hi Nick, On Mon, 2009-01-19 at 18:13 +1100, Nick Piggin wrote: > SLUB was distinctly slower on the tbench, netperf, and hackbench > tests that I ran. These were faster with SLUB on your machine? I was trying to bisect a somewhat recent SLAB vs. SLUB regression in tbench that seems to be triggered by CONFIG_SLUB as suggested by Evgeniy Polyakov performance tests. Unfortunately I bisected it down to a bogus commit so while I saw SLUB beating SLAB, I also saw the reverse in nearby commits which didn't touch anything interesting. So for tbench, SLUB _used to_ dominate SLAB on my machine but the current situation is not as clear with all the tbench regressions in other subsystems. SLUB has been a consistent winner for hackbench after Christoph fixed the regression reported by Ingo Molnar two years (?) ago. I don't think I've ran netperf, but for the fio test you mentioned, SLUB is beating SLAB here. On Mon, 2009-01-19 at 18:13 +1100, Nick Piggin wrote: > What kind of system is it? 2-way Core2. I posted my /proc/cpuinfo in this thread if you're interested. On Mon, 2009-01-19 at 18:13 +1100, Nick Piggin wrote: > > So I have very mixed feelings about SLQB. It's very > > nice that it works for OLTP but we still don't have much insight (i.e. > > numbers) on why it's better. On Mon, 2009-01-19 at 18:13 +1100, Nick Piggin wrote: > According to estimates in this thread, I think Matthew said SLUB would > be around 6% slower? SLQB is within measurement error of SLAB. Yeah but I say that we don't know _why_ it's better. There's the kmalloc()/kfree() CPU ping-pong hypothesis but it could also be due to page allocator interaction or just a plain bug in SLUB. And lets not forget bad interaction with some random subsystem (SCSI, for example). On Mon, 2009-01-19 at 18:13 +1100, Nick Piggin wrote: > Fair point about personally reproducing the OLTP problem yourself. But > the fact is that we will get problem reports that cannot be reproduced. > That does not make them less relevant. I can't reproduce the OLTP > benchmark myself. And I'm fully expecting to get problem reports for > SLQB against insanely sized SGI systems, which I will take very seriously > and try to fix them. Again, it's not that I don't take the OLTP regression seriously (I do) but as a "part-time maintainer" I simply don't have the time and resources to attempt to fix it without either (a) being able to reproduce the problem or (b) have someone who can reproduce it who is willing to do oprofile and so on. So as much as I would have preferred that you had at least attempted to fix SLUB, I'm more than happy that we have a very active developer working on the problem now. I mean, I don't really care which allocator we decide to go forward with, if all the relevant regressions are dealt with. All I am saying is that I don't like how we're fixing a performance bug with a shiny new allocator without a credible explanation why the current approach is not fixable. On Mon, 2009-01-19 at 18:13 +1100, Nick Piggin wrote: > > The good news is that SLQB can replace SLAB so either way, we're not > > going to end up with four allocators. Whether it can replace SLUB > > remains to be seen. > > Well I think being able to simply replace SLAB is not ideal. The plan > I'm hoping is to have four allocators for a few releases, and then > go back to having two. That is going to mean some groups might not > have their ideal allocator merged... but I think it is crazy to settle > with more than one main compile-time allocator for the long term. So now the HPC folk will be screwed over by the OLTP folk? I guess that's okay as the latter have been treated rather badly for the past two years.... ;-) Pekka ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Mainline kernel OLTP performance update 2009-01-19 8:05 ` Pekka Enberg @ 2009-01-19 8:33 ` Nick Piggin 2009-01-19 8:42 ` Nick Piggin 2009-01-19 9:48 ` Pekka Enberg 0 siblings, 2 replies; 93+ messages in thread From: Nick Piggin @ 2009-01-19 8:33 UTC (permalink / raw) To: Pekka Enberg Cc: Matthew Wilcox, Andrew Morton, Wilcox, Matthew R, chinang.ma, linux-kernel, sharad.c.tripathi, arjan, andi.kleen, suresh.b.siddha, harita.chilukuri, douglas.w.styner, peter.xihong.wang, hubert.nueckel, chris.mason, srostedt, linux-scsi, Andrew Vasquez, Anirban Chakraborty, Christoph Lameter On Monday 19 January 2009 19:05:03 Pekka Enberg wrote: > Hi Nick, > > On Mon, 2009-01-19 at 18:13 +1100, Nick Piggin wrote: > > SLUB was distinctly slower on the tbench, netperf, and hackbench > > tests that I ran. These were faster with SLUB on your machine? > > I was trying to bisect a somewhat recent SLAB vs. SLUB regression in > tbench that seems to be triggered by CONFIG_SLUB as suggested by Evgeniy > Polyakov performance tests. Unfortunately I bisected it down to a bogus > commit so while I saw SLUB beating SLAB, I also saw the reverse in > nearby commits which didn't touch anything interesting. So for tbench, > SLUB _used to_ dominate SLAB on my machine but the current situation is > not as clear with all the tbench regressions in other subsystems. OK. > SLUB has been a consistent winner for hackbench after Christoph fixed > the regression reported by Ingo Molnar two years (?) ago. I don't think > I've ran netperf, but for the fio test you mentioned, SLUB is beating > SLAB here. Hmm, netperf, hackbench, and fio are all faster with SLAB than SLUB. > On Mon, 2009-01-19 at 18:13 +1100, Nick Piggin wrote: > > What kind of system is it? > > 2-way Core2. I posted my /proc/cpuinfo in this thread if you're > interested. Thanks. I guess one of three obvious differences, mine is a K10, is NUMA, and has significantly more cores. I can try setting it to interleave cachelines over nodes or use fewer cores to see if the picture changes... > On Mon, 2009-01-19 at 18:13 +1100, Nick Piggin wrote: > > > So I have very mixed feelings about SLQB. It's very > > > nice that it works for OLTP but we still don't have much insight (i.e. > > > numbers) on why it's better. > > On Mon, 2009-01-19 at 18:13 +1100, Nick Piggin wrote: > > According to estimates in this thread, I think Matthew said SLUB would > > be around 6% slower? SLQB is within measurement error of SLAB. > > Yeah but I say that we don't know _why_ it's better. There's the > kmalloc()/kfree() CPU ping-pong hypothesis but it could also be due to > page allocator interaction or just a plain bug in SLUB. And lets not > forget bad interaction with some random subsystem (SCSI, for example). > > On Mon, 2009-01-19 at 18:13 +1100, Nick Piggin wrote: > > Fair point about personally reproducing the OLTP problem yourself. But > > the fact is that we will get problem reports that cannot be reproduced. > > That does not make them less relevant. I can't reproduce the OLTP > > benchmark myself. And I'm fully expecting to get problem reports for > > SLQB against insanely sized SGI systems, which I will take very seriously > > and try to fix them. > > Again, it's not that I don't take the OLTP regression seriously (I do) > but as a "part-time maintainer" I simply don't have the time and > resources to attempt to fix it without either (a) being able to > reproduce the problem or (b) have someone who can reproduce it who is > willing to do oprofile and so on. > > So as much as I would have preferred that you had at least attempted to > fix SLUB, I'm more than happy that we have a very active developer > working on the problem now. I mean, I don't really care which allocator > we decide to go forward with, if all the relevant regressions are dealt > with. OK, good to know. > All I am saying is that I don't like how we're fixing a performance bug > with a shiny new allocator without a credible explanation why the > current approach is not fixable. To be honest, my biggest concern with SLUB is the higher order pages thing. But Christoph always poo poos me when I raise that concern, and it's hard to get concrete numbers showing real fragmentation problems when it can take days or months to start biting. It really stems from queueing versus not queueing I guess. And I think SLUB is flawed due to its avoidance of queueing. > On Mon, 2009-01-19 at 18:13 +1100, Nick Piggin wrote: > > > The good news is that SLQB can replace SLAB so either way, we're not > > > going to end up with four allocators. Whether it can replace SLUB > > > remains to be seen. > > > > Well I think being able to simply replace SLAB is not ideal. The plan > > I'm hoping is to have four allocators for a few releases, and then > > go back to having two. That is going to mean some groups might not > > have their ideal allocator merged... but I think it is crazy to settle > > with more than one main compile-time allocator for the long term. > > So now the HPC folk will be screwed over by the OLTP folk? No. I'm imagining there will be a discussion of the 3, and at some point an executive decision will be made if an agreement can't be reached. At this point, I think that is a better and fairer option than just asserting one allocator is better than another and making it the default. And... we have no indication that SLQB will be worse for HPC than SLUB ;) > I guess > that's okay as the latter have been treated rather badly for the past > two years.... ;-) I don't know if that is meant to be sarcastic, but the OLTP performance numbers almost never get better from one kernel to the next. Actually the trend is downward. Mainly due to bloat or new features being added. I think that at some level, controlled addition of features that may add some cycles to these paths is not a bad idea (what good is Moore's Law if we can't have shiny new features? :) But on the other hand, this OLTP test is incredibly valuable to monitor the general performance- health of this area of the kernel. ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Mainline kernel OLTP performance update 2009-01-19 8:33 ` Nick Piggin @ 2009-01-19 8:42 ` Nick Piggin 2009-01-19 8:47 ` Pekka Enberg 2009-01-19 9:48 ` Pekka Enberg 1 sibling, 1 reply; 93+ messages in thread From: Nick Piggin @ 2009-01-19 8:42 UTC (permalink / raw) To: Pekka Enberg Cc: Matthew Wilcox, Andrew Morton, Wilcox, Matthew R, chinang.ma, linux-kernel, sharad.c.tripathi, arjan, andi.kleen, suresh.b.siddha, harita.chilukuri, douglas.w.styner, peter.xihong.wang, hubert.nueckel, chris.mason, srostedt, linux-scsi, Andrew Vasquez, Anirban Chakraborty, Christoph Lameter On Monday 19 January 2009 19:33:27 Nick Piggin wrote: > On Monday 19 January 2009 19:05:03 Pekka Enberg wrote: > > All I am saying is that I don't like how we're fixing a performance bug > > with a shiny new allocator without a credible explanation why the > > current approach is not fixable. > > To be honest, my biggest concern with SLUB is the higher order pages > thing. But Christoph always poo poos me when I raise that concern, and > it's hard to get concrete numbers showing real fragmentation problems > when it can take days or months to start biting. > > It really stems from queueing versus not queueing I guess. And I think > SLUB is flawed due to its avoidance of queueing. And FWIW, Christoph was also not able to fix the OLTP problem although I think it has been known for nearly two years ago now (I remember we talked about it at 2007 KS, although I wasn't following slab development very keenly back then). At this point I feel spending time working on SLUB isn't a good idea if a) Christoph himself hadn't fixed this problem; and b) we disagree about fundamental design choices (see the "SLQB slab allocator" thread). Anyway, nobody has disagreed with my proposal to merge SLQB, so in the worst case I don't think it will cause too much harm, and in the best case it might turn out to make the best tradeoffs and who knows, it might actually not be catastrophic for HPC ;) ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Mainline kernel OLTP performance update 2009-01-19 8:42 ` Nick Piggin @ 2009-01-19 8:47 ` Pekka Enberg 2009-01-19 8:57 ` Nick Piggin 0 siblings, 1 reply; 93+ messages in thread From: Pekka Enberg @ 2009-01-19 8:47 UTC (permalink / raw) To: Nick Piggin Cc: Matthew Wilcox, Andrew Morton, Wilcox, Matthew R, chinang.ma, linux-kernel, sharad.c.tripathi, arjan, andi.kleen, suresh.b.siddha, harita.chilukuri, douglas.w.styner, peter.xihong.wang, hubert.nueckel, chris.mason, srostedt, linux-scsi, Andrew Vasquez, Anirban Chakraborty, Christoph Lameter On Mon, Jan 19, 2009 at 10:42 AM, Nick Piggin <nickpiggin@yahoo.com.au> wrote: > Anyway, nobody has disagreed with my proposal to merge SLQB, so in the > worst case I don't think it will cause too much harm, and in the best > case it might turn out to make the best tradeoffs and who knows, it > might actually not be catastrophic for HPC ;) Yeah. If Andrew/Linus doesn't want to merge SLQB to 2.6.29, we can stick it in linux-next through slab.git if you want. Pekka ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Mainline kernel OLTP performance update 2009-01-19 8:47 ` Pekka Enberg @ 2009-01-19 8:57 ` Nick Piggin 0 siblings, 0 replies; 93+ messages in thread From: Nick Piggin @ 2009-01-19 8:57 UTC (permalink / raw) To: Pekka Enberg Cc: Matthew Wilcox, Andrew Morton, Wilcox, Matthew R, chinang.ma, linux-kernel, sharad.c.tripathi, arjan, andi.kleen, suresh.b.siddha, harita.chilukuri, douglas.w.styner, peter.xihong.wang, hubert.nueckel, chris.mason, srostedt, linux-scsi, Andrew Vasquez, Anirban Chakraborty, Christoph Lameter On Monday 19 January 2009 19:47:24 Pekka Enberg wrote: > On Mon, Jan 19, 2009 at 10:42 AM, Nick Piggin <nickpiggin@yahoo.com.au> wrote: > > Anyway, nobody has disagreed with my proposal to merge SLQB, so in the > > worst case I don't think it will cause too much harm, and in the best > > case it might turn out to make the best tradeoffs and who knows, it > > might actually not be catastrophic for HPC ;) > > Yeah. If Andrew/Linus doesn't want to merge SLQB to 2.6.29, we can I would prefer not. Apart from not practicing what I preach about merging, if it has stupid bugs on some systems or obvious performance problems, it will not be a good start ;) > stick it in linux-next through slab.git if you want. That would be appreciated. It's not quite ready yet... Thanks. Nick ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Mainline kernel OLTP performance update 2009-01-19 8:33 ` Nick Piggin 2009-01-19 8:42 ` Nick Piggin @ 2009-01-19 9:48 ` Pekka Enberg 2009-01-19 10:03 ` Nick Piggin 1 sibling, 1 reply; 93+ messages in thread From: Pekka Enberg @ 2009-01-19 9:48 UTC (permalink / raw) To: Nick Piggin Cc: Matthew Wilcox, Andrew Morton, Wilcox, Matthew R, chinang.ma, linux-kernel, sharad.c.tripathi, arjan, andi.kleen, suresh.b.siddha, harita.chilukuri, douglas.w.styner, peter.xihong.wang, hubert.nueckel, chris.mason, srostedt, linux-scsi, Andrew Vasquez, Anirban Chakraborty, Christoph Lameter Hi Nick, On Mon, Jan 19, 2009 at 10:33 AM, Nick Piggin <nickpiggin@yahoo.com.au> wrote: >> All I am saying is that I don't like how we're fixing a performance bug >> with a shiny new allocator without a credible explanation why the >> current approach is not fixable. > > To be honest, my biggest concern with SLUB is the higher order pages > thing. But Christoph always poo poos me when I raise that concern, and > it's hard to get concrete numbers showing real fragmentation problems > when it can take days or months to start biting. To be fair to SLUB, we do have the pending slab defragmentation patches in my tree. Not that we have any numbers on if defragmentation helps and how much. IIRC, Christoph said one of the reasons for avoiding queues in SLUB is to be able to do defragmentation. But I suppose with SLQB we can do the same thing as long as we flush the queues before attempting to defrag. Pekka ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Mainline kernel OLTP performance update 2009-01-19 9:48 ` Pekka Enberg @ 2009-01-19 10:03 ` Nick Piggin 0 siblings, 0 replies; 93+ messages in thread From: Nick Piggin @ 2009-01-19 10:03 UTC (permalink / raw) To: Pekka Enberg Cc: Matthew Wilcox, Andrew Morton, Wilcox, Matthew R, chinang.ma, linux-kernel, sharad.c.tripathi, arjan, andi.kleen, suresh.b.siddha, harita.chilukuri, douglas.w.styner, peter.xihong.wang, hubert.nueckel, chris.mason, srostedt, linux-scsi, Andrew Vasquez, Anirban Chakraborty, Christoph Lameter On Monday 19 January 2009 20:48:52 Pekka Enberg wrote: > Hi Nick, > > On Mon, Jan 19, 2009 at 10:33 AM, Nick Piggin <nickpiggin@yahoo.com.au> wrote: > >> All I am saying is that I don't like how we're fixing a performance bug > >> with a shiny new allocator without a credible explanation why the > >> current approach is not fixable. > > > > To be honest, my biggest concern with SLUB is the higher order pages > > thing. But Christoph always poo poos me when I raise that concern, and > > it's hard to get concrete numbers showing real fragmentation problems > > when it can take days or months to start biting. > > To be fair to SLUB, we do have the pending slab defragmentation > patches in my tree. Not that we have any numbers on if defragmentation > helps and how much. IIRC, Christoph said one of the reasons for > avoiding queues in SLUB is to be able to do defragmentation. But I > suppose with SLQB we can do the same thing as long as we flush the > queues before attempting to defrag. I have had a look at them, (and I raised some concerns about races with the bufferhead "defragmentation" patch which I didn't get a reply to, but now's not the time to get into that). Christoph's design AFAIKS is not impossible with queued slab allocators, but they would just need to do either some kind of per-cpu processing, at least a way to flush queues of objects. This should not be impossible. But in my reply, I also outlined an idea for a possibly better design for targetted slab reclaim that could have fewer of the locking complexitiesin other subsystems like the slub defrag patches do. I plan to look at this at some point, but I think we need to sort out the basics first. ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Mainline kernel OLTP performance update 2009-01-16 10:31 ` Pekka Enberg 2009-01-16 10:42 ` Nick Piggin @ 2009-01-16 20:59 ` Christoph Lameter 1 sibling, 0 replies; 93+ messages in thread From: Christoph Lameter @ 2009-01-16 20:59 UTC (permalink / raw) To: Pekka Enberg Cc: Nick Piggin, Matthew Wilcox, Andrew Morton, Wilcox, Matthew R, chinang.ma, linux-kernel, sharad.c.tripathi, arjan, andi.kleen, suresh.b.siddha, harita.chilukuri, douglas.w.styner, peter.xihong.wang, hubert.nueckel, chris.mason, srostedt, linux-scsi, Andrew Vasquez, Anirban Chakraborty On Fri, 16 Jan 2009, Pekka Enberg wrote: > aren't I? There's some scheduler related options such as > CONFIG_GROUP_SCHED and CONFIG_FAIR_GROUP_SCHED enabled in defconfig > that I didn't have in the original tests. I can try without them if > you want but I'm not sure it's relevant for SLAB vs SLUB tests. I have seen CONFIG_GROUP_SCHED to affect latency tests in significant ways. ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Mainline kernel OLTP performance update 2009-01-15 7:24 ` Nick Piggin 2009-01-15 9:46 ` Pekka Enberg @ 2009-01-16 0:27 ` Andrew Morton 2009-01-16 4:03 ` Nick Piggin 1 sibling, 1 reply; 93+ messages in thread From: Andrew Morton @ 2009-01-16 0:27 UTC (permalink / raw) To: Nick Piggin Cc: matthew, matthew.r.wilcox, chinang.ma, linux-kernel, sharad.c.tripathi, arjan, andi.kleen, suresh.b.siddha, harita.chilukuri, douglas.w.styner, peter.xihong.wang, hubert.nueckel, chris.mason, srostedt, linux-scsi, andrew.vasquez, anirban.chakraborty On Thu, 15 Jan 2009 18:24:36 +1100 Nick Piggin <nickpiggin@yahoo.com.au> wrote: > Given that SLAB and SLUB are fairly mature, I wonder what you'd think of > taking SLQB into -mm and making it the default there for a while, to see > if anybody reports a problem? Nobody would test it in interesting ways. We'd get more testing in linux-next, but still not enough, and not of the right type. It would be better to just make the desision, merge it and forge ahead. Me, I'd be 100% behind the idea if it had a credible prospect of a net reduction in the number of slab allocator implementations. I guess the naming convention will limit us to 26 of them. Fortunate indeed that the kernel isn't written in cyrillic! ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Mainline kernel OLTP performance update 2009-01-16 0:27 ` Andrew Morton @ 2009-01-16 4:03 ` Nick Piggin 2009-01-16 4:12 ` Andrew Morton 0 siblings, 1 reply; 93+ messages in thread From: Nick Piggin @ 2009-01-16 4:03 UTC (permalink / raw) To: Andrew Morton Cc: matthew, matthew.r.wilcox, chinang.ma, linux-kernel, sharad.c.tripathi, arjan, andi.kleen, suresh.b.siddha, harita.chilukuri, douglas.w.styner, peter.xihong.wang, hubert.nueckel, chris.mason, srostedt, linux-scsi, andrew.vasquez, anirban.chakraborty On Friday 16 January 2009 11:27:35 Andrew Morton wrote: > On Thu, 15 Jan 2009 18:24:36 +1100 > > Nick Piggin <nickpiggin@yahoo.com.au> wrote: > > Given that SLAB and SLUB are fairly mature, I wonder what you'd think of > > taking SLQB into -mm and making it the default there for a while, to see > > if anybody reports a problem? > > Nobody would test it in interesting ways. > > We'd get more testing in linux-next, but still not enough, and not of > the right type. It would be better than nothing, for SLQB, I guess. > It would be better to just make the desision, merge it and forge ahead. > > Me, I'd be 100% behind the idea if it had a credible prospect of a net > reduction in the number of slab allocator implementations. >From the data we have so far, I think SLQB is a "credible prospect" to replace SLUB and SLAB. But then again, apparently SLUB was a credible prospect to replace SLAB when it was merged. Unfortunately I can't honestly say that some serious regression will not be discovered in SLQB that cannot be fixed. I guess that's never stopped us merging other rewrites before, though. I would like to see SLQB merged in mainline, made default, and wait for some number releases. Then we take what we know, and try to make an informed decision about the best one to take. I guess that is problematic in that the rest of the kernel is moving underneath us. Do you have another idea? > I guess the naming convention will limit us to 26 of them. Fortunate > indeed that the kernel isn't written in cyrillic! I could have called it SL4B. 4 would be somehow fitting... ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Mainline kernel OLTP performance update 2009-01-16 4:03 ` Nick Piggin @ 2009-01-16 4:12 ` Andrew Morton 2009-01-16 6:46 ` Nick Piggin 0 siblings, 1 reply; 93+ messages in thread From: Andrew Morton @ 2009-01-16 4:12 UTC (permalink / raw) To: Nick Piggin Cc: matthew, matthew.r.wilcox, chinang.ma, linux-kernel, sharad.c.tripathi, arjan, andi.kleen, suresh.b.siddha, harita.chilukuri, douglas.w.styner, peter.xihong.wang, hubert.nueckel, chris.mason, srostedt, linux-scsi, andrew.vasquez, anirban.chakraborty On Fri, 16 Jan 2009 15:03:12 +1100 Nick Piggin <nickpiggin@yahoo.com.au> wrote: > I would like to see SLQB merged in mainline, made default, and wait for > some number releases. Then we take what we know, and try to make an > informed decision about the best one to take. I guess that is problematic > in that the rest of the kernel is moving underneath us. Do you have > another idea? Nope. If it doesn't work out, we can remove it again I guess. ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Mainline kernel OLTP performance update 2009-01-16 4:12 ` Andrew Morton @ 2009-01-16 6:46 ` Nick Piggin 2009-01-16 6:55 ` Matthew Wilcox ` (2 more replies) 0 siblings, 3 replies; 93+ messages in thread From: Nick Piggin @ 2009-01-16 6:46 UTC (permalink / raw) To: Andrew Morton, netdev, sfr Cc: matthew, matthew.r.wilcox, chinang.ma, linux-kernel, sharad.c.tripathi, arjan, andi.kleen, suresh.b.siddha, harita.chilukuri, douglas.w.styner, peter.xihong.wang, hubert.nueckel, chris.mason, srostedt, linux-scsi, andrew.vasquez, anirban.chakraborty On Friday 16 January 2009 15:12:10 Andrew Morton wrote: > On Fri, 16 Jan 2009 15:03:12 +1100 Nick Piggin <nickpiggin@yahoo.com.au> wrote: > > I would like to see SLQB merged in mainline, made default, and wait for > > some number releases. Then we take what we know, and try to make an > > informed decision about the best one to take. I guess that is problematic > > in that the rest of the kernel is moving underneath us. Do you have > > another idea? > > Nope. If it doesn't work out, we can remove it again I guess. OK, I have these numbers to show I'm not completely off my rocker to suggest we merge SLQB :) Given these results, how about I ask to merge SLQB as default in linux-next, then if nothing catastrophic happens, merge it upstream in the next merge window, then a couple of releases after that, given some time to test and tweak SLQB, then we plan to bite the bullet and emerge with just one main slab allocator (plus SLOB). System is a 2socket, 4 core AMD. All debug and stats options turned off for all the allocators; default parameters (ie. SLUB using higher order pages, and the others tend to be using order-0). SLQB is the version I recently posted, with some of the prefetching removed according to Pekka's review (probably a good idea to only add things like that in if/when they prove to be an improvement). time fio examples/netio (10 runs, lower better): SLAB AVG=13.19 STD=0.40 SLQB AVG=13.78 STD=0.24 SLUB AVG=14.47 STD=0.23 SLAB makes a good showing here. The allocation/freeing pattern seems to be very regular and easy (fast allocs and frees). So it could be some "lucky" caching behaviour, I'm not exactly sure. I'll have to run more tests and profiles here. hackbench (10 runs, lower better): 1 GROUP SLAB AVG=1.34 STD=0.05 SLQB AVG=1.31 STD=0.06 SLUB AVG=1.46 STD=0.07 2 GROUPS SLAB AVG=1.20 STD=0.09 SLQB AVG=1.22 STD=0.12 SLUB AVG=1.21 STD=0.06 4 GROUPS SLAB AVG=0.84 STD=0.05 SLQB AVG=0.81 STD=0.10 SLUB AVG=0.98 STD=0.07 8 GROUPS SLAB AVG=0.79 STD=0.10 SLQB AVG=0.76 STD=0.15 SLUB AVG=0.89 STD=0.08 16 GROUPS SLAB AVG=0.78 STD=0.08 SLQB AVG=0.79 STD=0.10 SLUB AVG=0.86 STD=0.05 32 GROUPS SLAB AVG=0.86 STD=0.05 SLQB AVG=0.78 STD=0.06 SLUB AVG=0.88 STD=0.06 64 GROUPS SLAB AVG=1.03 STD=0.05 SLQB AVG=0.90 STD=0.04 SLUB AVG=1.05 STD=0.06 128 GROUPS SLAB AVG=1.31 STD=0.19 SLQB AVG=1.16 STD=0.36 SLUB AVG=1.29 STD=0.11 SLQB tends to be the winner here. SLAB is close at lower numbers of groups, but drops behind a bit more as they increase. tbench (10 runs, higher better): 1 THREAD SLAB AVG=239.25 STD=31.74 SLQB AVG=257.75 STD=33.89 SLUB AVG=223.02 STD=14.73 2 THREADS SLAB AVG=649.56 STD=9.77 SLQB AVG=647.77 STD=7.48 SLUB AVG=634.50 STD=7.66 4 THREADS SLAB AVG=1294.52 STD=13.19 SLQB AVG=1266.58 STD=35.71 SLUB AVG=1228.31 STD=48.08 8 THREADS SLAB AVG=2750.78 STD=26.67 SLQB AVG=2758.90 STD=18.86 SLUB AVG=2685.59 STD=22.41 16 THREADS SLAB AVG=2669.11 STD=58.34 SLQB AVG=2671.69 STD=31.84 SLUB AVG=2571.05 STD=45.39 SLAB and SLQB seem to be pretty close, winning some and losing some. They're always within a standard deviation of one another, so we can't make conclusions between them. SLUB seems to be a bit slower. Netperf UDP unidirectional send test (10 runs, higher better): Server and client bound to same CPU SLAB AVG=60.111 STD=1.59382 SLQB AVG=60.167 STD=0.685347 SLUB AVG=58.277 STD=0.788328 Server and client bound to same socket, different CPUs SLAB AVG=85.938 STD=0.875794 SLQB AVG=93.662 STD=2.07434 SLUB AVG=81.983 STD=0.864362 Server and client bound to different sockets SLAB AVG=78.801 STD=1.44118 SLQB AVG=78.269 STD=1.10457 SLUB AVG=71.334 STD=1.16809 SLQB is up with SLAB for the first and last cases, and faster in the second case. SLUB trails in each case. (Any ideas for better types of netperf tests?) Kbuild numbers don't seem to be significantly different. SLAB and SLQB actually got exactly the same average over 10 runs. The user+sys times tend to be almost identical between allocators, with elapsed time mainly depending on how much time the CPU was not idle. Intel's OLTP shows SLQB is "neutral" to SLAB. That is, literally within their measurement confidence interval. If it comes down to it, I think we could get them to do more runs to narrow that down, but we're talking a couple of tenths of a percent already. I haven't done any non-local network tests. Networking is the one of the subsystems most heavily dependent on slab performance, so if anybody cares to run their favourite tests, that would be really helpful. Disclaimer ---------- Now remember this is just one specific HW configuration, and some allocators for some reason give significantly (and sometimes perplexingly) different results between different CPU and system architectures. The other frustrating thing is that sometimes you happen to get a lucky or unlucky cache or NUMA layout depending on the compile, the boot, etc. So sometimes results get a little "skewed" in a way that isn't reflected in the STDDEV. But I've tried to minimise that. Dropping caches and restarting services etc. between individual runs. ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Mainline kernel OLTP performance update 2009-01-16 6:46 ` Nick Piggin @ 2009-01-16 6:55 ` Matthew Wilcox 2009-01-16 7:06 ` Nick Piggin 2009-01-16 7:53 ` Zhang, Yanmin 2009-01-16 7:00 ` Mainline kernel OLTP performance update Andrew Morton 2009-01-16 18:11 ` Rick Jones 2 siblings, 2 replies; 93+ messages in thread From: Matthew Wilcox @ 2009-01-16 6:55 UTC (permalink / raw) To: Nick Piggin Cc: Andrew Morton, netdev, sfr, matthew.r.wilcox, chinang.ma, linux-kernel, sharad.c.tripathi, arjan, andi.kleen, suresh.b.siddha, harita.chilukuri, douglas.w.styner, peter.xihong.wang, hubert.nueckel, chris.mason, srostedt, linux-scsi, andrew.vasquez, anirban.chakraborty, Zhang, Yanmin On Fri, Jan 16, 2009 at 05:46:23PM +1100, Nick Piggin wrote: > Intel's OLTP shows SLQB is "neutral" to SLAB. That is, literally within > their measurement confidence interval. If it comes down to it, I think we > could get them to do more runs to narrow that down, but we're talking a > couple of tenths of a percent already. I think I can speak with some measure of confidence for at least the OLTP-testing part of my company when I say that I have no objection to Nick's planned merge scheme. I believe the kernel benchmark group have also done some testing with SLQB and have generally positive things to say about it (Yanmin added to the gargantuan cc). Did slabtop get fixed to work with SLQB? -- Matthew Wilcox Intel Open Source Technology Centre "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Mainline kernel OLTP performance update 2009-01-16 6:55 ` Matthew Wilcox @ 2009-01-16 7:06 ` Nick Piggin 2009-01-16 7:53 ` Zhang, Yanmin 1 sibling, 0 replies; 93+ messages in thread From: Nick Piggin @ 2009-01-16 7:06 UTC (permalink / raw) To: Matthew Wilcox Cc: Andrew Morton, netdev, sfr, matthew.r.wilcox, chinang.ma, linux-kernel, sharad.c.tripathi, arjan, andi.kleen, suresh.b.siddha, harita.chilukuri, douglas.w.styner, peter.xihong.wang, hubert.nueckel, chris.mason, srostedt, linux-scsi, andrew.vasquez, anirban.chakraborty, Zhang, Yanmin On Friday 16 January 2009 17:55:47 Matthew Wilcox wrote: > On Fri, Jan 16, 2009 at 05:46:23PM +1100, Nick Piggin wrote: > > Intel's OLTP shows SLQB is "neutral" to SLAB. That is, literally within > > their measurement confidence interval. If it comes down to it, I think we > > could get them to do more runs to narrow that down, but we're talking a > > couple of tenths of a percent already. > > I think I can speak with some measure of confidence for at least the > OLTP-testing part of my company when I say that I have no objection to > Nick's planned merge scheme. > > I believe the kernel benchmark group have also done some testing with > SLQB and have generally positive things to say about it (Yanmin added to > the gargantuan cc). > > Did slabtop get fixed to work with SLQB? Yes the old slabtop that works on /proc/slabinfo works with SLQB (ie. SLQB implements /proc/slabinfo). Lin Ming recently also ported the SLUB /sys/kernel/slab/ specific slabinfo tool to SLQB. Basically it reports in-depth internal event counts etc. and can operate on individual caches, making it very useful for performance "observability" and tuning. It is hard to come up with a single set of statistics that apply usefully to all the allocators. FWIW, it would be a useful tool to port over to SLAB too, if we end up deciding to go with SLAB. ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Mainline kernel OLTP performance update 2009-01-16 6:55 ` Matthew Wilcox 2009-01-16 7:06 ` Nick Piggin @ 2009-01-16 7:53 ` Zhang, Yanmin 2009-01-16 10:20 ` Andi Kleen 1 sibling, 1 reply; 93+ messages in thread From: Zhang, Yanmin @ 2009-01-16 7:53 UTC (permalink / raw) To: Matthew Wilcox Cc: Nick Piggin, Andrew Morton, netdev, sfr, matthew.r.wilcox, chinang.ma, linux-kernel, sharad.c.tripathi, arjan, andi.kleen, suresh.b.siddha, harita.chilukuri, douglas.w.styner, peter.xihong.wang, hubert.nueckel, chris.mason, srostedt, linux-scsi, andrew.vasquez, anirban.chakraborty On Thu, 2009-01-15 at 23:55 -0700, Matthew Wilcox wrote: > On Fri, Jan 16, 2009 at 05:46:23PM +1100, Nick Piggin wrote: > > Intel's OLTP shows SLQB is "neutral" to SLAB. That is, literally within > > their measurement confidence interval. If it comes down to it, I think we > > could get them to do more runs to narrow that down, but we're talking a > > couple of tenths of a percent already. > > I think I can speak with some measure of confidence for at least the > OLTP-testing part of my company when I say that I have no objection to > Nick's planned merge scheme. > > I believe the kernel benchmark group have also done some testing with > SLQB and have generally positive things to say about it (Yanmin added to > the gargantuan cc). We did run lots of benchmarks with SLQB. Comparing with SLUB, one highlighting of SLQB is with netperf UDP-U-4k. On my x86-64 machines, if I start 1 client and 1 server process and bind them to different physical cpus, the result of SLQB is about 20% better than SLUB's. If I start CPU_NUM clients and the same number of servers without binding, the results of SLQB is about 100% better than SLUB's. I think that's because SLQB doesn't pass through big object allocation to page allocator. netperf UDP-U-1k has less improvement with SLQB. The results of other benchmarks have variations. They are good on some machines, but bad on other machines. However, the variation is small. For example, hackbench's result with SLQB is about 1 second than with SLUB on 8-core stoakley. After we worked with Nick to do small code changing, SLQB's result is a little better than SLUB's with hackbench on stoakley. We consider other variations as fluctuation. All the testing use default SLUB and SLQB configuration. > > Did slabtop get fixed to work with SLQB? > ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Mainline kernel OLTP performance update 2009-01-16 7:53 ` Zhang, Yanmin @ 2009-01-16 10:20 ` Andi Kleen 2009-01-20 5:16 ` Zhang, Yanmin 0 siblings, 1 reply; 93+ messages in thread From: Andi Kleen @ 2009-01-16 10:20 UTC (permalink / raw) To: Zhang, Yanmin Cc: Matthew Wilcox, Nick Piggin, Andrew Morton, netdev, sfr, matthew.r.wilcox, chinang.ma, linux-kernel, sharad.c.tripathi, arjan, suresh.b.siddha, harita.chilukuri, douglas.w.styner, peter.xihong.wang, hubert.nueckel, chris.mason, srostedt, linux-scsi, andrew.vasquez, anirban.chakraborty "Zhang, Yanmin" <yanmin_zhang@linux.intel.com> writes: > I think that's because SLQB > doesn't pass through big object allocation to page allocator. > netperf UDP-U-1k has less improvement with SLQB. That sounds like just the page allocator needs to be improved. That would help everyone. We talked a bit about this earlier, some of the heuristics for hot/cold pages are quite outdated and have been tuned for obsolete machines and also its fast path is quite long. Unfortunately no code currently. -Andi -- ak@linux.intel.com -- Speaking for myself only. ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Mainline kernel OLTP performance update 2009-01-16 10:20 ` Andi Kleen @ 2009-01-20 5:16 ` Zhang, Yanmin 2009-01-21 23:58 ` Christoph Lameter 0 siblings, 1 reply; 93+ messages in thread From: Zhang, Yanmin @ 2009-01-20 5:16 UTC (permalink / raw) To: Andi Kleen, Christoph Lameter, Pekka Enberg Cc: Matthew Wilcox, Nick Piggin, Andrew Morton, netdev, sfr, matthew.r.wilcox, chinang.ma, linux-kernel, sharad.c.tripathi, arjan, suresh.b.siddha, harita.chilukuri, douglas.w.styner, peter.xihong.wang, hubert.nueckel, chris.mason, srostedt, linux-scsi, andrew.vasquez, anirban.chakraborty On Fri, 2009-01-16 at 11:20 +0100, Andi Kleen wrote: > "Zhang, Yanmin" <yanmin_zhang@linux.intel.com> writes: > > > > I think that's because SLQB > > doesn't pass through big object allocation to page allocator. > > netperf UDP-U-1k has less improvement with SLQB. > > That sounds like just the page allocator needs to be improved. > That would help everyone. We talked a bit about this earlier, > some of the heuristics for hot/cold pages are quite outdated > and have been tuned for obsolete machines and also its fast path > is quite long. Unfortunately no code currently. Andi, Thanks for your kind information. I did more investigation with SLUB on netperf UDP-U-4k issue. oprofile shows: 328058 30.1342 linux-2.6.29-rc2 copy_user_generic_string 134666 12.3699 linux-2.6.29-rc2 __free_pages_ok 125447 11.5231 linux-2.6.29-rc2 get_page_from_freelist 22611 2.0770 linux-2.6.29-rc2 __sk_mem_reclaim 21442 1.9696 linux-2.6.29-rc2 list_del 21187 1.9462 linux-2.6.29-rc2 __ip_route_output_key So __free_pages_ok and get_page_from_freelist consume too much cpu time. With SLQB, these 2 functions almost don't consume time. Command 'slabinfo -AD' shows: Name Objects Alloc Free %Fast :0000256 1685 29611065 29609548 99 99 :0000168 2987 164689 161859 94 39 :0004096 1471 114918 113490 99 97 So kmem_cache :0000256 is very active. Kernel stack dump in __free_pages_ok shows [<ffffffff8027010f>] __free_pages_ok+0x109/0x2e0 [<ffffffff8024bb34>] autoremove_wake_function+0x0/0x2e [<ffffffff8060f387>] __kfree_skb+0x9/0x6f [<ffffffff8061204b>] skb_free_datagram+0xc/0x31 [<ffffffff8064b528>] udp_recvmsg+0x1e7/0x26f [<ffffffff8060b509>] sock_common_recvmsg+0x30/0x45 [<ffffffff80609acd>] sock_recvmsg+0xd5/0xed The callchain is: __kfree_skb => kfree_skbmem => kmem_cache_free(skbuff_head_cache, skb); kmem_cache skbuff_head_cache's object size is just 256, so it shares the kmem_cache with :0000256. Their order is 1 which means every slab consists of 2 physical pages. netperf UDP-U-4k is a UDP stream testing. client process keeps sending 4k-size packets to server process and server process just receives the packets one by one. If we start CPU_NUM clients and the same number of servers, every client will send lots of packets within one sched slice, then process scheduler schedules the server to receive many packets within one sched slice; then client resends again. So there are many packets in the queue. When server receive the packets, it frees skbuff_head_cache. When the slab's objects are all free, the slab will be released by calling __free_pages. Such batch sending/receiving creates lots of slab free activity. Page allocator has an array at zone_pcp(zone, cpu)->pcp to keep a page buffer for page order 0. But here skbuff_head_cache's order is 1, so UDP-U-4k couldn't benefit from the page buffer. SLQB has no such issue, because: 1) SLQB has a percpu freelist. Free objects are put to the list firstly and can be picked up later on quickly without lock. A batch parameter to control the free object recollection is mostly 1024. 2) SLQB slab order mostly is 0, so although sometimes it calls alloc_pages/free_pages, it can benefit from zone_pcp(zone, cpu)->pcp page buffer. So SLUB need resolve such issues that one process allocates a batch of objects and another process frees them batchly. yanmin ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Mainline kernel OLTP performance update 2009-01-20 5:16 ` Zhang, Yanmin @ 2009-01-21 23:58 ` Christoph Lameter 2009-01-22 8:36 ` Zhang, Yanmin 0 siblings, 1 reply; 93+ messages in thread From: Christoph Lameter @ 2009-01-21 23:58 UTC (permalink / raw) To: Zhang, Yanmin Cc: Andi Kleen, Pekka Enberg, Matthew Wilcox, Nick Piggin, Andrew Morton, netdev, sfr, matthew.r.wilcox, chinang.ma, linux-kernel, sharad.c.tripathi, arjan, suresh.b.siddha, harita.chilukuri, douglas.w.styner, peter.xihong.wang, hubert.nueckel, chris.mason, srostedt, linux-scsi, andrew.vasquez, anirban.chakraborty [-- Attachment #1: Type: TEXT/PLAIN, Size: 1708 bytes --] On Tue, 20 Jan 2009, Zhang, Yanmin wrote: > kmem_cache skbuff_head_cache's object size is just 256, so it shares the kmem_cache > with :0000256. Their order is 1 which means every slab consists of 2 physical pages. That order can be changed. Try specifying slub_max_order=0 on the kernel command line to force an order 0 alloc. The queues of the page allocator are of limited use due to their overhead. Order-1 allocations can actually be 5% faster than order-0. order-0 makes sense if pages are pushed rapidly to the page allocator and are then reissues elsewhere. If there is a linear consumption then the page allocator queues are just overhead. > Page allocator has an array at zone_pcp(zone, cpu)->pcp to keep a page buffer for page order 0. > But here skbuff_head_cache's order is 1, so UDP-U-4k couldn't benefit from the page buffer. That usually does not matter because of partial list avoiding page allocator actions. > SLQB has no such issue, because: > 1) SLQB has a percpu freelist. Free objects are put to the list firstly and can be picked up > later on quickly without lock. A batch parameter to control the free object recollection is mostly > 1024. > 2) SLQB slab order mostly is 0, so although sometimes it calls alloc_pages/free_pages, it can > benefit from zone_pcp(zone, cpu)->pcp page buffer. > > So SLUB need resolve such issues that one process allocates a batch of objects and another process > frees them batchly. SLUB has a percpu freelist but its bounded by the basic allocation unit. You can increase that by modifying the allocation order. Writing a 3 or 5 into the order value in /sys/kernel/slab/xxx/order would do the trick. ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Mainline kernel OLTP performance update 2009-01-21 23:58 ` Christoph Lameter @ 2009-01-22 8:36 ` Zhang, Yanmin 2009-01-22 9:15 ` Pekka Enberg 0 siblings, 1 reply; 93+ messages in thread From: Zhang, Yanmin @ 2009-01-22 8:36 UTC (permalink / raw) To: Christoph Lameter Cc: Andi Kleen, Pekka Enberg, Matthew Wilcox, Nick Piggin, Andrew Morton, netdev, sfr, matthew.r.wilcox, chinang.ma, linux-kernel, sharad.c.tripathi, arjan, suresh.b.siddha, harita.chilukuri, douglas.w.styner, peter.xihong.wang, hubert.nueckel, chris.mason, srostedt, linux-scsi, andrew.vasquez, anirban.chakraborty On Wed, 2009-01-21 at 18:58 -0500, Christoph Lameter wrote: > On Tue, 20 Jan 2009, Zhang, Yanmin wrote: > > > kmem_cache skbuff_head_cache's object size is just 256, so it shares the kmem_cache > > with :0000256. Their order is 1 which means every slab consists of 2 physical pages. > > That order can be changed. Try specifying slub_max_order=0 on the kernel > command line to force an order 0 alloc. I tried slub_max_order=0 and there is no improvement on this UDP-U-4k issue. Both get_page_from_freelist and __free_pages_ok's cpu time are still very high. I checked my instrumentation in kernel and found it's caused by large object allocation/free whose size is more than PAGE_SIZE. Here its order is 1. The right free callchain is __kfree_skb => skb_release_all => skb_release_data. So this case isn't the issue that batch of allocation/free might erase partial page functionality. '#slaninfo -AD' couldn't show statistics of large object allocation/free. Can we add such info? That will be more helpful. In addition, I didn't find such issue wih TCP stream testing. > > The queues of the page allocator are of limited use due to their overhead. > Order-1 allocations can actually be 5% faster than order-0. order-0 makes > sense if pages are pushed rapidly to the page allocator and are then > reissues elsewhere. If there is a linear consumption then the page > allocator queues are just overhead. > > > Page allocator has an array at zone_pcp(zone, cpu)->pcp to keep a page buffer for page order 0. > > But here skbuff_head_cache's order is 1, so UDP-U-4k couldn't benefit from the page buffer. > > That usually does not matter because of partial list avoiding page > allocator actions. > > > SLQB has no such issue, because: > > 1) SLQB has a percpu freelist. Free objects are put to the list firstly and can be picked up > > later on quickly without lock. A batch parameter to control the free object recollection is mostly > > 1024. > > 2) SLQB slab order mostly is 0, so although sometimes it calls alloc_pages/free_pages, it can > > benefit from zone_pcp(zone, cpu)->pcp page buffer. > > > > So SLUB need resolve such issues that one process allocates a batch of objects and another process > > frees them batchly. > > SLUB has a percpu freelist but its bounded by the basic allocation unit. > You can increase that by modifying the allocation order. Writing a 3 or 5 > into the order value in /sys/kernel/slab/xxx/order would do the trick. ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Mainline kernel OLTP performance update 2009-01-22 8:36 ` Zhang, Yanmin @ 2009-01-22 9:15 ` Pekka Enberg 2009-01-22 9:28 ` Zhang, Yanmin 0 siblings, 1 reply; 93+ messages in thread From: Pekka Enberg @ 2009-01-22 9:15 UTC (permalink / raw) To: Zhang, Yanmin Cc: Christoph Lameter, Andi Kleen, Matthew Wilcox, Nick Piggin, Andrew Morton, netdev, sfr, matthew.r.wilcox, chinang.ma, linux-kernel, sharad.c.tripathi, arjan, suresh.b.siddha, harita.chilukuri, douglas.w.styner, peter.xihong.wang, hubert.nueckel, chris.mason, srostedt, linux-scsi, andrew.vasquez, anirban.chakraborty On Thu, 2009-01-22 at 16:36 +0800, Zhang, Yanmin wrote: > On Wed, 2009-01-21 at 18:58 -0500, Christoph Lameter wrote: > > On Tue, 20 Jan 2009, Zhang, Yanmin wrote: > > > > > kmem_cache skbuff_head_cache's object size is just 256, so it shares the kmem_cache > > > with :0000256. Their order is 1 which means every slab consists of 2 physical pages. > > > > That order can be changed. Try specifying slub_max_order=0 on the kernel > > command line to force an order 0 alloc. > I tried slub_max_order=0 and there is no improvement on this UDP-U-4k issue. > Both get_page_from_freelist and __free_pages_ok's cpu time are still very high. > > I checked my instrumentation in kernel and found it's caused by large object allocation/free > whose size is more than PAGE_SIZE. Here its order is 1. > > The right free callchain is __kfree_skb => skb_release_all => skb_release_data. > > So this case isn't the issue that batch of allocation/free might erase partial page > functionality. So is this the kfree(skb->head) in skb_release_data() or the put_page() calls in the same function in a loop? If it's the former, with big enough size passed to __alloc_skb(), the networking code might be taking a hit from the SLUB page allocator pass-through. Pekka ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Mainline kernel OLTP performance update 2009-01-22 9:15 ` Pekka Enberg @ 2009-01-22 9:28 ` Zhang, Yanmin 2009-01-22 9:47 ` Pekka Enberg 0 siblings, 1 reply; 93+ messages in thread From: Zhang, Yanmin @ 2009-01-22 9:28 UTC (permalink / raw) To: Pekka Enberg Cc: Christoph Lameter, Andi Kleen, Matthew Wilcox, Nick Piggin, Andrew Morton, netdev, sfr, matthew.r.wilcox, chinang.ma, linux-kernel, sharad.c.tripathi, arjan, suresh.b.siddha, harita.chilukuri, douglas.w.styner, peter.xihong.wang, hubert.nueckel, chris.mason, srostedt, linux-scsi, andrew.vasquez, anirban.chakraborty On Thu, 2009-01-22 at 11:15 +0200, Pekka Enberg wrote: > On Thu, 2009-01-22 at 16:36 +0800, Zhang, Yanmin wrote: > > On Wed, 2009-01-21 at 18:58 -0500, Christoph Lameter wrote: > > > On Tue, 20 Jan 2009, Zhang, Yanmin wrote: > > > > > > > kmem_cache skbuff_head_cache's object size is just 256, so it shares the kmem_cache > > > > with :0000256. Their order is 1 which means every slab consists of 2 physical pages. > > > > > > That order can be changed. Try specifying slub_max_order=0 on the kernel > > > command line to force an order 0 alloc. > > I tried slub_max_order=0 and there is no improvement on this UDP-U-4k issue. > > Both get_page_from_freelist and __free_pages_ok's cpu time are still very high. > > > > I checked my instrumentation in kernel and found it's caused by large object allocation/free > > whose size is more than PAGE_SIZE. Here its order is 1. > > > > The right free callchain is __kfree_skb => skb_release_all => skb_release_data. > > > > So this case isn't the issue that batch of allocation/free might erase partial page > > functionality. > > So is this the kfree(skb->head) in skb_release_data() or the put_page() > calls in the same function in a loop? It's kfree(skb->head). > > If it's the former, with big enough size passed to __alloc_skb(), the > networking code might be taking a hit from the SLUB page allocator > pass-through. ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Mainline kernel OLTP performance update 2009-01-22 9:28 ` Zhang, Yanmin @ 2009-01-22 9:47 ` Pekka Enberg 2009-01-23 3:02 ` Zhang, Yanmin 0 siblings, 1 reply; 93+ messages in thread From: Pekka Enberg @ 2009-01-22 9:47 UTC (permalink / raw) To: Zhang, Yanmin Cc: Christoph Lameter, Andi Kleen, Matthew Wilcox, Nick Piggin, Andrew Morton, netdev, sfr, matthew.r.wilcox, chinang.ma, linux-kernel, sharad.c.tripathi, arjan, suresh.b.siddha, harita.chilukuri, douglas.w.styner, peter.xihong.wang, hubert.nueckel, chris.mason, srostedt, linux-scsi, andrew.vasquez, anirban.chakraborty On Thu, 2009-01-22 at 17:28 +0800, Zhang, Yanmin wrote: > On Thu, 2009-01-22 at 11:15 +0200, Pekka Enberg wrote: > > On Thu, 2009-01-22 at 16:36 +0800, Zhang, Yanmin wrote: > > > On Wed, 2009-01-21 at 18:58 -0500, Christoph Lameter wrote: > > > > On Tue, 20 Jan 2009, Zhang, Yanmin wrote: > > > > > > > > > kmem_cache skbuff_head_cache's object size is just 256, so it shares the kmem_cache > > > > > with :0000256. Their order is 1 which means every slab consists of 2 physical pages. > > > > > > > > That order can be changed. Try specifying slub_max_order=0 on the kernel > > > > command line to force an order 0 alloc. > > > I tried slub_max_order=0 and there is no improvement on this UDP-U-4k issue. > > > Both get_page_from_freelist and __free_pages_ok's cpu time are still very high. > > > > > > I checked my instrumentation in kernel and found it's caused by large object allocation/free > > > whose size is more than PAGE_SIZE. Here its order is 1. > > > > > > The right free callchain is __kfree_skb => skb_release_all => skb_release_data. > > > > > > So this case isn't the issue that batch of allocation/free might erase partial page > > > functionality. > > > > So is this the kfree(skb->head) in skb_release_data() or the put_page() > > calls in the same function in a loop? > It's kfree(skb->head). > > > > > If it's the former, with big enough size passed to __alloc_skb(), the > > networking code might be taking a hit from the SLUB page allocator > > pass-through. Do we know what kind of size is being passed to __alloc_skb() in this case? Maybe we want to do something like this. Pekka SLUB: revert page allocator pass-through This is a revert of commit aadb4bc4a1f9108c1d0fbd121827c936c2ed4217 ("SLUB: direct pass through of page size or higher kmalloc requests"). --- diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h index 2f5c16b..3bd3662 100644 --- a/include/linux/slub_def.h +++ b/include/linux/slub_def.h @@ -124,7 +124,7 @@ struct kmem_cache { * We keep the general caches in an array of slab caches that are used for * 2^x bytes of allocations. */ -extern struct kmem_cache kmalloc_caches[PAGE_SHIFT + 1]; +extern struct kmem_cache kmalloc_caches[KMALLOC_SHIFT_HIGH + 1]; /* * Sorry that the following has to be that ugly but some versions of GCC @@ -135,6 +135,9 @@ static __always_inline int kmalloc_index(size_t size) if (!size) return 0; + if (size > KMALLOC_MAX_SIZE) + return -1; + if (size <= KMALLOC_MIN_SIZE) return KMALLOC_SHIFT_LOW; @@ -154,10 +157,6 @@ static __always_inline int kmalloc_index(size_t size) if (size <= 1024) return 10; if (size <= 2 * 1024) return 11; if (size <= 4 * 1024) return 12; -/* - * The following is only needed to support architectures with a larger page - * size than 4k. - */ if (size <= 8 * 1024) return 13; if (size <= 16 * 1024) return 14; if (size <= 32 * 1024) return 15; @@ -167,6 +166,10 @@ static __always_inline int kmalloc_index(size_t size) if (size <= 512 * 1024) return 19; if (size <= 1024 * 1024) return 20; if (size <= 2 * 1024 * 1024) return 21; + if (size <= 4 * 1024 * 1024) return 22; + if (size <= 8 * 1024 * 1024) return 23; + if (size <= 16 * 1024 * 1024) return 24; + if (size <= 32 * 1024 * 1024) return 25; return -1; /* @@ -191,6 +194,19 @@ static __always_inline struct kmem_cache *kmalloc_slab(size_t size) if (index == 0) return NULL; + /* + * This function only gets expanded if __builtin_constant_p(size), so + * testing it here shouldn't be needed. But some versions of gcc need + * help. + */ + if (__builtin_constant_p(size) && index < 0) { + /* + * Generate a link failure. Would be great if we could + * do something to stop the compile here. + */ + extern void __kmalloc_size_too_large(void); + __kmalloc_size_too_large(); + } return &kmalloc_caches[index]; } @@ -204,17 +220,9 @@ static __always_inline struct kmem_cache *kmalloc_slab(size_t size) void *kmem_cache_alloc(struct kmem_cache *, gfp_t); void *__kmalloc(size_t size, gfp_t flags); -static __always_inline void *kmalloc_large(size_t size, gfp_t flags) -{ - return (void *)__get_free_pages(flags | __GFP_COMP, get_order(size)); -} - static __always_inline void *kmalloc(size_t size, gfp_t flags) { if (__builtin_constant_p(size)) { - if (size > PAGE_SIZE) - return kmalloc_large(size, flags); - if (!(flags & SLUB_DMA)) { struct kmem_cache *s = kmalloc_slab(size); diff --git a/mm/slub.c b/mm/slub.c index 6392ae5..8fad23f 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -2475,7 +2475,7 @@ EXPORT_SYMBOL(kmem_cache_destroy); * Kmalloc subsystem *******************************************************************/ -struct kmem_cache kmalloc_caches[PAGE_SHIFT + 1] __cacheline_aligned; +struct kmem_cache kmalloc_caches[KMALLOC_SHIFT_HIGH + 1] __cacheline_aligned; EXPORT_SYMBOL(kmalloc_caches); static int __init setup_slub_min_order(char *str) @@ -2537,7 +2537,7 @@ panic: } #ifdef CONFIG_ZONE_DMA -static struct kmem_cache *kmalloc_caches_dma[PAGE_SHIFT + 1]; +static struct kmem_cache *kmalloc_caches_dma[KMALLOC_SHIFT_HIGH + 1]; static void sysfs_add_func(struct work_struct *w) { @@ -2643,8 +2643,12 @@ static struct kmem_cache *get_slab(size_t size, gfp_t flags) return ZERO_SIZE_PTR; index = size_index[(size - 1) / 8]; - } else + } else { + if (size > KMALLOC_MAX_SIZE) + return NULL; + index = fls(size - 1); + } #ifdef CONFIG_ZONE_DMA if (unlikely((flags & SLUB_DMA))) @@ -2658,9 +2662,6 @@ void *__kmalloc(size_t size, gfp_t flags) { struct kmem_cache *s; - if (unlikely(size > PAGE_SIZE)) - return kmalloc_large(size, flags); - s = get_slab(size, flags); if (unlikely(ZERO_OR_NULL_PTR(s))) @@ -2670,25 +2671,11 @@ void *__kmalloc(size_t size, gfp_t flags) } EXPORT_SYMBOL(__kmalloc); -static void *kmalloc_large_node(size_t size, gfp_t flags, int node) -{ - struct page *page = alloc_pages_node(node, flags | __GFP_COMP, - get_order(size)); - - if (page) - return page_address(page); - else - return NULL; -} - #ifdef CONFIG_NUMA void *__kmalloc_node(size_t size, gfp_t flags, int node) { struct kmem_cache *s; - if (unlikely(size > PAGE_SIZE)) - return kmalloc_large_node(size, flags, node); - s = get_slab(size, flags); if (unlikely(ZERO_OR_NULL_PTR(s))) @@ -2746,11 +2733,8 @@ void kfree(const void *x) return; page = virt_to_head_page(x); - if (unlikely(!PageSlab(page))) { - BUG_ON(!PageCompound(page)); - put_page(page); + if (unlikely(WARN_ON(!PageSlab(page)))) /* XXX */ return; - } slab_free(page->slab, page, object, _RET_IP_); } EXPORT_SYMBOL(kfree); @@ -2985,7 +2969,7 @@ void __init kmem_cache_init(void) caches++; } - for (i = KMALLOC_SHIFT_LOW; i <= PAGE_SHIFT; i++) { + for (i = KMALLOC_SHIFT_LOW; i <= KMALLOC_SHIFT_HIGH; i++) { create_kmalloc_cache(&kmalloc_caches[i], "kmalloc", 1 << i, GFP_KERNEL); caches++; @@ -3022,7 +3006,7 @@ void __init kmem_cache_init(void) slab_state = UP; /* Provide the correct kmalloc names now that the caches are up */ - for (i = KMALLOC_SHIFT_LOW; i <= PAGE_SHIFT; i++) + for (i = KMALLOC_SHIFT_LOW; i <= KMALLOC_SHIFT_HIGH; i++) kmalloc_caches[i]. name = kasprintf(GFP_KERNEL, "kmalloc-%d", 1 << i); @@ -3222,9 +3206,6 @@ void *__kmalloc_track_caller(size_t size, gfp_t gfpflags, unsigned long caller) { struct kmem_cache *s; - if (unlikely(size > PAGE_SIZE)) - return kmalloc_large(size, gfpflags); - s = get_slab(size, gfpflags); if (unlikely(ZERO_OR_NULL_PTR(s))) @@ -3238,9 +3219,6 @@ void *__kmalloc_node_track_caller(size_t size, gfp_t gfpflags, { struct kmem_cache *s; - if (unlikely(size > PAGE_SIZE)) - return kmalloc_large_node(size, gfpflags, node); - s = get_slab(size, gfpflags); if (unlikely(ZERO_OR_NULL_PTR(s))) ^ permalink raw reply related [flat|nested] 93+ messages in thread
* Re: Mainline kernel OLTP performance update 2009-01-22 9:47 ` Pekka Enberg @ 2009-01-23 3:02 ` Zhang, Yanmin 2009-01-23 6:52 ` Pekka Enberg 2009-01-23 8:33 ` Nick Piggin 0 siblings, 2 replies; 93+ messages in thread From: Zhang, Yanmin @ 2009-01-23 3:02 UTC (permalink / raw) To: Pekka Enberg Cc: Christoph Lameter, Andi Kleen, Matthew Wilcox, Nick Piggin, Andrew Morton, netdev, sfr, matthew.r.wilcox, chinang.ma, linux-kernel, sharad.c.tripathi, arjan, suresh.b.siddha, harita.chilukuri, douglas.w.styner, peter.xihong.wang, hubert.nueckel, chris.mason, srostedt, linux-scsi, andrew.vasquez, anirban.chakraborty On Thu, 2009-01-22 at 11:47 +0200, Pekka Enberg wrote: > On Thu, 2009-01-22 at 17:28 +0800, Zhang, Yanmin wrote: > > On Thu, 2009-01-22 at 11:15 +0200, Pekka Enberg wrote: > > > On Thu, 2009-01-22 at 16:36 +0800, Zhang, Yanmin wrote: > > > > On Wed, 2009-01-21 at 18:58 -0500, Christoph Lameter wrote: > > > > > On Tue, 20 Jan 2009, Zhang, Yanmin wrote: > > > > > > > > > > > kmem_cache skbuff_head_cache's object size is just 256, so it shares the kmem_cache > > > > > > with :0000256. Their order is 1 which means every slab consists of 2 physical pages. > > > > > > > > > > That order can be changed. Try specifying slub_max_order=0 on the kernel > > > > > command line to force an order 0 alloc. > > > > I tried slub_max_order=0 and there is no improvement on this UDP-U-4k issue. > > > > Both get_page_from_freelist and __free_pages_ok's cpu time are still very high. > > > > > > > > I checked my instrumentation in kernel and found it's caused by large object allocation/free > > > > whose size is more than PAGE_SIZE. Here its order is 1. > > > > > > > > The right free callchain is __kfree_skb => skb_release_all => skb_release_data. > > > > > > > > So this case isn't the issue that batch of allocation/free might erase partial page > > > > functionality. > > > > > > So is this the kfree(skb->head) in skb_release_data() or the put_page() > > > calls in the same function in a loop? > > It's kfree(skb->head). > > > > > > > > If it's the former, with big enough size passed to __alloc_skb(), the > > > networking code might be taking a hit from the SLUB page allocator > > > pass-through. > > Do we know what kind of size is being passed to __alloc_skb() in this > case? In function __alloc_skb, original parameter size=4155, SKB_DATA_ALIGN(size)=4224, sizeof(struct skb_shared_info)=472, so __kmalloc_track_caller's parameter size=4696. > Maybe we want to do something like this. > > Pekka > > SLUB: revert page allocator pass-through This patch amost fixes the netperf UDP-U-4k issue. #slabinfo -AD Name Objects Alloc Free %Fast :0000256 1658 70350463 70348946 99 99 kmalloc-8192 31 70322309 70322293 99 99 :0000168 2592 143154 140684 93 28 :0004096 1456 91072 89644 99 96 :0000192 3402 63838 60491 89 11 :0000064 6177 49635 43743 98 77 So kmalloc-8192 appears. Without the patch, kmalloc-8192 hides. kmalloc-8192's default order on my 8-core stoakley is 2. 1) If I start CPU_NUM clients and servers, SLUB's result is about 2% better than SLQB's; 2) If I start 1 clinet and 1 server, and bind them to different physical cpu, SLQB's result is about 10% better than SLUB's. I don't know why there is still 10% difference with item 2). Maybe cachemiss causes it? > > This is a revert of commit aadb4bc4a1f9108c1d0fbd121827c936c2ed4217 ("SLUB: > direct pass through of page size or higher kmalloc requests"). > --- > > diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h > index 2f5c16b..3bd3662 100644 > --- a/include/linux/slub_def.h > +++ b/include/linux/slub_def.h ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Mainline kernel OLTP performance update 2009-01-23 3:02 ` Zhang, Yanmin @ 2009-01-23 6:52 ` Pekka Enberg 2009-01-23 8:06 ` Pekka Enberg 2009-01-23 8:33 ` Nick Piggin 1 sibling, 1 reply; 93+ messages in thread From: Pekka Enberg @ 2009-01-23 6:52 UTC (permalink / raw) To: Zhang, Yanmin Cc: Christoph Lameter, Andi Kleen, Matthew Wilcox, Nick Piggin, Andrew Morton, netdev, sfr, matthew.r.wilcox, chinang.ma, linux-kernel, sharad.c.tripathi, arjan, suresh.b.siddha, harita.chilukuri, douglas.w.styner, peter.xihong.wang, hubert.nueckel, chris.mason, srostedt, linux-scsi, andrew.vasquez, anirban.chakraborty, mingo Zhang, Yanmin wrote: >>>> If it's the former, with big enough size passed to __alloc_skb(), the >>>> networking code might be taking a hit from the SLUB page allocator >>>> pass-through. >> Do we know what kind of size is being passed to __alloc_skb() in this >> case? > In function __alloc_skb, original parameter size=4155, > SKB_DATA_ALIGN(size)=4224, sizeof(struct skb_shared_info)=472, so > __kmalloc_track_caller's parameter size=4696. OK, so all allocations go straight to the page allocator. > >> Maybe we want to do something like this. >> >> SLUB: revert page allocator pass-through > This patch amost fixes the netperf UDP-U-4k issue. > > #slabinfo -AD > Name Objects Alloc Free %Fast > :0000256 1658 70350463 70348946 99 99 > kmalloc-8192 31 70322309 70322293 99 99 > :0000168 2592 143154 140684 93 28 > :0004096 1456 91072 89644 99 96 > :0000192 3402 63838 60491 89 11 > :0000064 6177 49635 43743 98 77 > > So kmalloc-8192 appears. Without the patch, kmalloc-8192 hides. > kmalloc-8192's default order on my 8-core stoakley is 2. Christoph, should we merge my patch as-is or do you have an alternative fix in mind? We could, of course, increase kmalloc() caches one level up to 8192 or higher. > > 1) If I start CPU_NUM clients and servers, SLUB's result is about 2% better than SLQB's; > 2) If I start 1 clinet and 1 server, and bind them to different physical cpu, SLQB's result > is about 10% better than SLUB's. > > I don't know why there is still 10% difference with item 2). Maybe cachemiss causes it? Maybe we can use the perfstat and/or kerneltop utilities of the new perf counters patch to diagnose this: http://lkml.org/lkml/2009/1/21/273 And do oprofile, of course. Thanks! Pekka ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Mainline kernel OLTP performance update 2009-01-23 6:52 ` Pekka Enberg @ 2009-01-23 8:06 ` Pekka Enberg 2009-01-23 8:30 ` Zhang, Yanmin 0 siblings, 1 reply; 93+ messages in thread From: Pekka Enberg @ 2009-01-23 8:06 UTC (permalink / raw) To: Zhang, Yanmin Cc: Christoph Lameter, Andi Kleen, Matthew Wilcox, Nick Piggin, Andrew Morton, netdev, sfr, matthew.r.wilcox, chinang.ma, linux-kernel, sharad.c.tripathi, arjan, suresh.b.siddha, harita.chilukuri, douglas.w.styner, peter.xihong.wang, hubert.nueckel, chris.mason, srostedt, linux-scsi, andrew.vasquez, anirban.chakraborty, mingo On Fri, 2009-01-23 at 08:52 +0200, Pekka Enberg wrote: > > 1) If I start CPU_NUM clients and servers, SLUB's result is about 2% better than SLQB's; > > 2) If I start 1 clinet and 1 server, and bind them to different physical cpu, SLQB's result > > is about 10% better than SLUB's. > > > > I don't know why there is still 10% difference with item 2). Maybe cachemiss causes it? > > Maybe we can use the perfstat and/or kerneltop utilities of the new perf > counters patch to diagnose this: > > http://lkml.org/lkml/2009/1/21/273 > > And do oprofile, of course. Thanks! I assume binding the client and the server to different physical CPUs also means that the SKB is always allocated on CPU 1 and freed on CPU 2? If so, we will be taking the __slab_free() slow path all the time on kfree() which will cause cache effects, no doubt. But there's another potential performance hit we're taking because the object size of the cache is so big. As allocations from CPU 1 keep coming in, we need to allocate new pages and unfreeze the per-cpu page. That in turn causes __slab_free() to be more eager to discard the slab (see the PageSlubFrozen check there). So before going for cache profiling, I'd really like to see an oprofile report. I suspect we're still going to see much more page allocator activity there than with SLAB or SLQB which is why we're still behaving so badly here. Pekka ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Mainline kernel OLTP performance update 2009-01-23 8:06 ` Pekka Enberg @ 2009-01-23 8:30 ` Zhang, Yanmin 2009-01-23 8:40 ` Pekka Enberg 2009-01-23 9:46 ` Pekka Enberg 0 siblings, 2 replies; 93+ messages in thread From: Zhang, Yanmin @ 2009-01-23 8:30 UTC (permalink / raw) To: Pekka Enberg Cc: Christoph Lameter, Andi Kleen, Matthew Wilcox, Nick Piggin, Andrew Morton, netdev, sfr, matthew.r.wilcox, chinang.ma, linux-kernel, sharad.c.tripathi, arjan, suresh.b.siddha, harita.chilukuri, douglas.w.styner, peter.xihong.wang, hubert.nueckel, chris.mason, srostedt, linux-scsi, andrew.vasquez, anirban.chakraborty, mingo On Fri, 2009-01-23 at 10:06 +0200, Pekka Enberg wrote: > On Fri, 2009-01-23 at 08:52 +0200, Pekka Enberg wrote: > > > 1) If I start CPU_NUM clients and servers, SLUB's result is about 2% better than SLQB's; > > > 2) If I start 1 clinet and 1 server, and bind them to different physical cpu, SLQB's result > > > is about 10% better than SLUB's. > > > > > > I don't know why there is still 10% difference with item 2). Maybe cachemiss causes it? > > > > Maybe we can use the perfstat and/or kerneltop utilities of the new perf > > counters patch to diagnose this: > > > > http://lkml.org/lkml/2009/1/21/273 > > > > And do oprofile, of course. Thanks! > > I assume binding the client and the server to different physical CPUs > also means that the SKB is always allocated on CPU 1 and freed on CPU > 2? If so, we will be taking the __slab_free() slow path all the time on > kfree() which will cause cache effects, no doubt. > > But there's another potential performance hit we're taking because the > object size of the cache is so big. As allocations from CPU 1 keep > coming in, we need to allocate new pages and unfreeze the per-cpu page. > That in turn causes __slab_free() to be more eager to discard the slab > (see the PageSlubFrozen check there). > > So before going for cache profiling, I'd really like to see an oprofile > report. I suspect we're still going to see much more page allocator > activity Theoretically, it should, but oprofile doesn't show that. > there than with SLAB or SLQB which is why we're still behaving > so badly here. oprofile output with 2.6.29-rc2-slubrevertlarge: CPU: Core 2, speed 2666.71 MHz (estimated) Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000 samples % app name symbol name 132779 32.9951 vmlinux copy_user_generic_string 25334 6.2954 vmlinux schedule 21032 5.2264 vmlinux tg_shares_up 17175 4.2679 vmlinux __skb_recv_datagram 9091 2.2591 vmlinux sock_def_readable 8934 2.2201 vmlinux mwait_idle 8796 2.1858 vmlinux try_to_wake_up 6940 1.7246 vmlinux __slab_free #slaninfo -AD Name Objects Alloc Free %Fast :0000256 1643 5215544 5214027 94 0 kmalloc-8192 28 5189576 5189560 0 0 :0000168 2631 141466 138976 92 28 :0004096 1452 88697 87269 99 96 :0000192 3402 63050 59732 89 11 :0000064 6265 46611 40721 98 82 :0000128 1895 30429 28654 93 32 oprofile output with kernel 2.6.29-rc2-slqb0121: CPU: Core 2, speed 2666.76 MHz (estimated) Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000 samples % image name app name symbol name 114793 28.7163 vmlinux vmlinux copy_user_generic_string 27880 6.9744 vmlinux vmlinux tg_shares_up 22218 5.5580 vmlinux vmlinux schedule 12238 3.0614 vmlinux vmlinux mwait_idle 7395 1.8499 vmlinux vmlinux task_rq_lock 7348 1.8382 vmlinux vmlinux sock_def_readable 7202 1.8016 vmlinux vmlinux sched_clock_cpu 6981 1.7464 vmlinux vmlinux __skb_recv_datagram 6566 1.6425 vmlinux vmlinux udp_queue_rcv_skb ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Mainline kernel OLTP performance update 2009-01-23 8:30 ` Zhang, Yanmin @ 2009-01-23 8:40 ` Pekka Enberg 2009-01-23 9:46 ` Pekka Enberg 1 sibling, 0 replies; 93+ messages in thread From: Pekka Enberg @ 2009-01-23 8:40 UTC (permalink / raw) To: Zhang, Yanmin Cc: Christoph Lameter, Andi Kleen, Matthew Wilcox, Nick Piggin, Andrew Morton, netdev, sfr, matthew.r.wilcox, chinang.ma, linux-kernel, sharad.c.tripathi, arjan, suresh.b.siddha, harita.chilukuri, douglas.w.styner, peter.xihong.wang, hubert.nueckel, chris.mason, srostedt, linux-scsi, andrew.vasquez, anirban.chakraborty, mingo On Fri, 2009-01-23 at 16:30 +0800, Zhang, Yanmin wrote: > > I assume binding the client and the server to different physical CPUs > > also means that the SKB is always allocated on CPU 1 and freed on CPU > > 2? If so, we will be taking the __slab_free() slow path all the time on > > kfree() which will cause cache effects, no doubt. > > > > But there's another potential performance hit we're taking because the > > object size of the cache is so big. As allocations from CPU 1 keep > > coming in, we need to allocate new pages and unfreeze the per-cpu page. > > That in turn causes __slab_free() to be more eager to discard the slab > > (see the PageSlubFrozen check there). > > > > So before going for cache profiling, I'd really like to see an oprofile > > report. I suspect we're still going to see much more page allocator > > activity > Theoretically, it should, but oprofile doesn't show that. > > > there than with SLAB or SLQB which is why we're still behaving > > so badly here. > > oprofile output with 2.6.29-rc2-slubrevertlarge: > CPU: Core 2, speed 2666.71 MHz (estimated) > Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000 > samples % app name symbol name > 132779 32.9951 vmlinux copy_user_generic_string > 25334 6.2954 vmlinux schedule > 21032 5.2264 vmlinux tg_shares_up > 17175 4.2679 vmlinux __skb_recv_datagram > 9091 2.2591 vmlinux sock_def_readable > 8934 2.2201 vmlinux mwait_idle > 8796 2.1858 vmlinux try_to_wake_up > 6940 1.7246 vmlinux __slab_free > > #slaninfo -AD > Name Objects Alloc Free %Fast > :0000256 1643 5215544 5214027 94 0 > kmalloc-8192 28 5189576 5189560 0 0 ^^^^^^ This looks bit funny. Hmm. > :0000168 2631 141466 138976 92 28 > :0004096 1452 88697 87269 99 96 > :0000192 3402 63050 59732 89 11 > :0000064 6265 46611 40721 98 82 > :0000128 1895 30429 28654 93 32 ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Mainline kernel OLTP performance update 2009-01-23 8:30 ` Zhang, Yanmin 2009-01-23 8:40 ` Pekka Enberg @ 2009-01-23 9:46 ` Pekka Enberg 2009-01-23 15:22 ` Christoph Lameter 1 sibling, 1 reply; 93+ messages in thread From: Pekka Enberg @ 2009-01-23 9:46 UTC (permalink / raw) To: Zhang, Yanmin Cc: Christoph Lameter, Andi Kleen, Matthew Wilcox, Nick Piggin, Andrew Morton, netdev, sfr, matthew.r.wilcox, chinang.ma, linux-kernel, sharad.c.tripathi, arjan, suresh.b.siddha, harita.chilukuri, douglas.w.styner, peter.xihong.wang, hubert.nueckel, chris.mason, srostedt, linux-scsi, andrew.vasquez, anirban.chakraborty, mingo On Fri, 2009-01-23 at 16:30 +0800, Zhang, Yanmin wrote: > On Fri, 2009-01-23 at 10:06 +0200, Pekka Enberg wrote: > > On Fri, 2009-01-23 at 08:52 +0200, Pekka Enberg wrote: > > > > 1) If I start CPU_NUM clients and servers, SLUB's result is about 2% better than SLQB's; > > > > 2) If I start 1 clinet and 1 server, and bind them to different physical cpu, SLQB's result > > > > is about 10% better than SLUB's. > > > > > > > > I don't know why there is still 10% difference with item 2). Maybe cachemiss causes it? > > > > > > Maybe we can use the perfstat and/or kerneltop utilities of the new perf > > > counters patch to diagnose this: > > > > > > http://lkml.org/lkml/2009/1/21/273 > > > > > > And do oprofile, of course. Thanks! > > > > I assume binding the client and the server to different physical CPUs > > also means that the SKB is always allocated on CPU 1 and freed on CPU > > 2? If so, we will be taking the __slab_free() slow path all the time on > > kfree() which will cause cache effects, no doubt. > > > > But there's another potential performance hit we're taking because the > > object size of the cache is so big. As allocations from CPU 1 keep > > coming in, we need to allocate new pages and unfreeze the per-cpu page. > > That in turn causes __slab_free() to be more eager to discard the slab > > (see the PageSlubFrozen check there). > > > > So before going for cache profiling, I'd really like to see an oprofile > > report. I suspect we're still going to see much more page allocator > > activity > Theoretically, it should, but oprofile doesn't show that. That's bit surprising, actually. FWIW, I've included a patch for empty slab lists. But it's probably not going to help here. > > there than with SLAB or SLQB which is why we're still behaving > > so badly here. > > oprofile output with 2.6.29-rc2-slubrevertlarge: > CPU: Core 2, speed 2666.71 MHz (estimated) > Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000 > samples % app name symbol name > 132779 32.9951 vmlinux copy_user_generic_string > 25334 6.2954 vmlinux schedule > 21032 5.2264 vmlinux tg_shares_up > 17175 4.2679 vmlinux __skb_recv_datagram > 9091 2.2591 vmlinux sock_def_readable > 8934 2.2201 vmlinux mwait_idle > 8796 2.1858 vmlinux try_to_wake_up > 6940 1.7246 vmlinux __slab_free > > #slaninfo -AD > Name Objects Alloc Free %Fast > :0000256 1643 5215544 5214027 94 0 > kmalloc-8192 28 5189576 5189560 0 0 > :0000168 2631 141466 138976 92 28 > :0004096 1452 88697 87269 99 96 > :0000192 3402 63050 59732 89 11 > :0000064 6265 46611 40721 98 82 > :0000128 1895 30429 28654 93 32 Looking at __slab_free(), unless page->inuse is constantly zero and we discard the slab, it really is just cache effects (10% sounds like a lot, though!). AFAICT, the only way to optimize that is with Christoph's unfinished pointer freelists patches or with a remote free list like in SLQB. Pekka diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h index 3bd3662..41a4c1a 100644 --- a/include/linux/slub_def.h +++ b/include/linux/slub_def.h @@ -48,6 +48,9 @@ struct kmem_cache_node { unsigned long nr_partial; unsigned long min_partial; struct list_head partial; + unsigned long nr_empty; + unsigned long max_empty; + struct list_head empty; #ifdef CONFIG_SLUB_DEBUG atomic_long_t nr_slabs; atomic_long_t total_objects; diff --git a/mm/slub.c b/mm/slub.c index 8fad23f..5a12597 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -134,6 +134,11 @@ */ #define MAX_PARTIAL 10 +/* + * Maximum number of empty slabs. + */ +#define MAX_EMPTY 1 + #define DEBUG_DEFAULT_FLAGS (SLAB_DEBUG_FREE | SLAB_RED_ZONE | \ SLAB_POISON | SLAB_STORE_USER) @@ -1205,6 +1210,24 @@ static void discard_slab(struct kmem_cache *s, struct page *page) free_slab(s, page); } +static void discard_or_cache_slab(struct kmem_cache *s, struct page *page) +{ + struct kmem_cache_node *n; + int node; + + node = page_to_nid(page); + n = get_node(s, node); + + dec_slabs_node(s, node, page->objects); + + if (likely(n->nr_empty >= n->max_empty)) { + free_slab(s, page); + } else { + n->nr_empty++; + list_add(&page->lru, &n->partial); + } +} + /* * Per slab locking using the pagelock */ @@ -1252,7 +1275,7 @@ static void remove_partial(struct kmem_cache *s, struct page *page) } /* - * Lock slab and remove from the partial list. + * Lock slab and remove from the partial or empty list. * * Must hold list_lock. */ @@ -1261,7 +1284,6 @@ static inline int lock_and_freeze_slab(struct kmem_cache_node *n, { if (slab_trylock(page)) { list_del(&page->lru); - n->nr_partial--; __SetPageSlubFrozen(page); return 1; } @@ -1271,7 +1293,7 @@ static inline int lock_and_freeze_slab(struct kmem_cache_node *n, /* * Try to allocate a partial slab from a specific node. */ -static struct page *get_partial_node(struct kmem_cache_node *n) +static struct page *get_partial_or_empty_node(struct kmem_cache_node *n) { struct page *page; @@ -1281,13 +1303,22 @@ static struct page *get_partial_node(struct kmem_cache_node *n) * partial slab and there is none available then get_partials() * will return NULL. */ - if (!n || !n->nr_partial) + if (!n || (!n->nr_partial && !n->nr_empty)) return NULL; spin_lock(&n->list_lock); + list_for_each_entry(page, &n->partial, lru) - if (lock_and_freeze_slab(n, page)) + if (lock_and_freeze_slab(n, page)) { + n->nr_partial--; + goto out; + } + + list_for_each_entry(page, &n->empty, lru) + if (lock_and_freeze_slab(n, page)) { + n->nr_empty--; goto out; + } page = NULL; out: spin_unlock(&n->list_lock); @@ -1297,7 +1328,7 @@ out: /* * Get a page from somewhere. Search in increasing NUMA distances. */ -static struct page *get_any_partial(struct kmem_cache *s, gfp_t flags) +static struct page *get_any_partial_or_empty(struct kmem_cache *s, gfp_t flags) { #ifdef CONFIG_NUMA struct zonelist *zonelist; @@ -1336,7 +1367,7 @@ static struct page *get_any_partial(struct kmem_cache *s, gfp_t flags) if (n && cpuset_zone_allowed_hardwall(zone, flags) && n->nr_partial > n->min_partial) { - page = get_partial_node(n); + page = get_partial_or_empty_node(n); if (page) return page; } @@ -1346,18 +1377,19 @@ static struct page *get_any_partial(struct kmem_cache *s, gfp_t flags) } /* - * Get a partial page, lock it and return it. + * Get a partial or empty page, lock it and return it. */ -static struct page *get_partial(struct kmem_cache *s, gfp_t flags, int node) +static struct page * +get_partial_or_empty(struct kmem_cache *s, gfp_t flags, int node) { struct page *page; int searchnode = (node == -1) ? numa_node_id() : node; - page = get_partial_node(get_node(s, searchnode)); + page = get_partial_or_empty_node(get_node(s, searchnode)); if (page || (flags & __GFP_THISNODE)) return page; - return get_any_partial(s, flags); + return get_any_partial_or_empty(s, flags); } /* @@ -1403,7 +1435,7 @@ static void unfreeze_slab(struct kmem_cache *s, struct page *page, int tail) } else { slab_unlock(page); stat(get_cpu_slab(s, raw_smp_processor_id()), FREE_SLAB); - discard_slab(s, page); + discard_or_cache_slab(s, page); } } } @@ -1542,7 +1574,7 @@ another_slab: deactivate_slab(s, c); new_slab: - new = get_partial(s, gfpflags, node); + new = get_partial_or_empty(s, gfpflags, node); if (new) { c->page = new; stat(c, ALLOC_FROM_PARTIAL); @@ -1693,7 +1725,7 @@ slab_empty: } slab_unlock(page); stat(c, FREE_SLAB); - discard_slab(s, page); + discard_or_cache_slab(s, page); return; debug: @@ -1927,6 +1959,8 @@ static void init_kmem_cache_cpu(struct kmem_cache *s, static void init_kmem_cache_node(struct kmem_cache_node *n, struct kmem_cache *s) { + spin_lock_init(&n->list_lock); + n->nr_partial = 0; /* @@ -1939,8 +1973,18 @@ init_kmem_cache_node(struct kmem_cache_node *n, struct kmem_cache *s) else if (n->min_partial > MAX_PARTIAL) n->min_partial = MAX_PARTIAL; - spin_lock_init(&n->list_lock); INIT_LIST_HEAD(&n->partial); + + n->nr_empty = 0; + /* + * XXX: This needs to take object size into account. We don't need + * empty slabs for caches which will have plenty of partial slabs + * available. Only caches that have either full or empty slabs need + * this kind of optimization. + */ + n->max_empty = MAX_EMPTY; + INIT_LIST_HEAD(&n->empty); + #ifdef CONFIG_SLUB_DEBUG atomic_long_set(&n->nr_slabs, 0); atomic_long_set(&n->total_objects, 0); @@ -2427,6 +2471,32 @@ static void free_partial(struct kmem_cache *s, struct kmem_cache_node *n) spin_unlock_irqrestore(&n->list_lock, flags); } +static void free_empty_slabs(struct kmem_cache *s) +{ + int node; + + for_each_node_state(node, N_NORMAL_MEMORY) { + struct kmem_cache_node *n; + struct page *page, *t; + unsigned long flags; + + n = get_node(s, node); + + if (!n->nr_empty) + continue; + + spin_lock_irqsave(&n->list_lock, flags); + + list_for_each_entry_safe(page, t, &n->empty, lru) { + list_del(&page->lru); + n->nr_empty--; + + free_slab(s, page); + } + spin_unlock_irqrestore(&n->list_lock, flags); + } +} + /* * Release all resources used by a slab cache. */ @@ -2436,6 +2506,8 @@ static inline int kmem_cache_close(struct kmem_cache *s) flush_all(s); + free_empty_slabs(s); + /* Attempt to free all objects */ free_kmem_cache_cpus(s); for_each_node_state(node, N_NORMAL_MEMORY) { @@ -2765,6 +2837,7 @@ int kmem_cache_shrink(struct kmem_cache *s) return -ENOMEM; flush_all(s); + free_empty_slabs(s); for_each_node_state(node, N_NORMAL_MEMORY) { n = get_node(s, node); ^ permalink raw reply related [flat|nested] 93+ messages in thread
* Re: Mainline kernel OLTP performance update 2009-01-23 9:46 ` Pekka Enberg @ 2009-01-23 15:22 ` Christoph Lameter 2009-01-23 15:31 ` Pekka Enberg 2009-01-24 2:55 ` Zhang, Yanmin 0 siblings, 2 replies; 93+ messages in thread From: Christoph Lameter @ 2009-01-23 15:22 UTC (permalink / raw) To: Pekka Enberg Cc: Zhang, Yanmin, Andi Kleen, Matthew Wilcox, Nick Piggin, Andrew Morton, netdev, Stephen Rothwell, matthew.r.wilcox, chinang.ma, linux-kernel, sharad.c.tripathi, arjan, suresh.b.siddha, harita.chilukuri, douglas.w.styner, peter.xihong.wang, hubert.nueckel, chris.mason, srostedt, linux-scsi, andrew.vasquez, anirban.chakraborty, Ingo Molnar On Fri, 23 Jan 2009, Pekka Enberg wrote: > Looking at __slab_free(), unless page->inuse is constantly zero and we > discard the slab, it really is just cache effects (10% sounds like a > lot, though!). AFAICT, the only way to optimize that is with Christoph's > unfinished pointer freelists patches or with a remote free list like in > SLQB. No there is another way. Increase the allocator order to 3 for the kmalloc-8192 slab then multiple 8k blocks can be allocated from one of the larger chunks of data gotten from the page allocator. That will allow slub to do fast allocs. ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Mainline kernel OLTP performance update 2009-01-23 15:22 ` Christoph Lameter @ 2009-01-23 15:31 ` Pekka Enberg 2009-01-23 15:55 ` Christoph Lameter 2009-01-24 2:55 ` Zhang, Yanmin 1 sibling, 1 reply; 93+ messages in thread From: Pekka Enberg @ 2009-01-23 15:31 UTC (permalink / raw) To: Christoph Lameter Cc: Zhang, Yanmin, Andi Kleen, Matthew Wilcox, Nick Piggin, Andrew Morton, netdev, Stephen Rothwell, matthew.r.wilcox, chinang.ma, linux-kernel, sharad.c.tripathi, arjan, suresh.b.siddha, harita.chilukuri, douglas.w.styner, peter.xihong.wang, hubert.nueckel, chris.mason, srostedt, linux-scsi, andrew.vasquez, anirban.chakraborty, Ingo Molnar On Fri, 2009-01-23 at 10:22 -0500, Christoph Lameter wrote: > On Fri, 23 Jan 2009, Pekka Enberg wrote: > > > Looking at __slab_free(), unless page->inuse is constantly zero and we > > discard the slab, it really is just cache effects (10% sounds like a > > lot, though!). AFAICT, the only way to optimize that is with Christoph's > > unfinished pointer freelists patches or with a remote free list like in > > SLQB. > > No there is another way. Increase the allocator order to 3 for the > kmalloc-8192 slab then multiple 8k blocks can be allocated from one of the > larger chunks of data gotten from the page allocator. That will allow slub > to do fast allocs. I wonder why that doesn't happen already, actually. The slub_max_order know is capped to PAGE_ALLOC_COSTLY_ORDER ("3") by default and obviously order 3 should be as good fit as order 2 so 'fraction' can't be too high either. Hmm. Pekka ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Mainline kernel OLTP performance update 2009-01-23 15:31 ` Pekka Enberg @ 2009-01-23 15:55 ` Christoph Lameter 2009-01-23 16:01 ` Pekka Enberg 0 siblings, 1 reply; 93+ messages in thread From: Christoph Lameter @ 2009-01-23 15:55 UTC (permalink / raw) To: Pekka Enberg Cc: Zhang, Yanmin, Andi Kleen, Matthew Wilcox, Nick Piggin, Andrew Morton, netdev, Stephen Rothwell, matthew.r.wilcox, chinang.ma, linux-kernel, sharad.c.tripathi, arjan, suresh.b.siddha, harita.chilukuri, douglas.w.styner, peter.xihong.wang, hubert.nueckel, chris.mason, srostedt, linux-scsi, andrew.vasquez, anirban.chakraborty, Ingo Molnar On Fri, 23 Jan 2009, Pekka Enberg wrote: > I wonder why that doesn't happen already, actually. The slub_max_order > know is capped to PAGE_ALLOC_COSTLY_ORDER ("3") by default and obviously > order 3 should be as good fit as order 2 so 'fraction' can't be too high > either. Hmm. The kmalloc-8192 is new. Look at slabinfo output to see what allocation orders are chosen. ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Mainline kernel OLTP performance update 2009-01-23 15:55 ` Christoph Lameter @ 2009-01-23 16:01 ` Pekka Enberg 0 siblings, 0 replies; 93+ messages in thread From: Pekka Enberg @ 2009-01-23 16:01 UTC (permalink / raw) To: Christoph Lameter Cc: Zhang, Yanmin, Andi Kleen, Matthew Wilcox, Nick Piggin, Andrew Morton, netdev, Stephen Rothwell, matthew.r.wilcox, chinang.ma, linux-kernel, sharad.c.tripathi, arjan, suresh.b.siddha, harita.chilukuri, douglas.w.styner, peter.xihong.wang, hubert.nueckel, chris.mason, srostedt, linux-scsi, andrew.vasquez, anirban.chakraborty, Ingo Molnar On Fri, 23 Jan 2009, Pekka Enberg wrote: > > I wonder why that doesn't happen already, actually. The slub_max_order > > know is capped to PAGE_ALLOC_COSTLY_ORDER ("3") by default and obviously > > order 3 should be as good fit as order 2 so 'fraction' can't be too high > > either. Hmm. On Fri, 2009-01-23 at 10:55 -0500, Christoph Lameter wrote: > The kmalloc-8192 is new. Look at slabinfo output to see what allocation > orders are chosen. Yes, yes, I know the new cache a result of my patch. I'm just saying that AFAICT, the existing logic should set the order to 3 but IIRC Yanmin said it's 2. Pekka ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Mainline kernel OLTP performance update 2009-01-23 15:22 ` Christoph Lameter 2009-01-23 15:31 ` Pekka Enberg @ 2009-01-24 2:55 ` Zhang, Yanmin 2009-01-24 7:36 ` Pekka Enberg 2009-01-26 17:36 ` Christoph Lameter 1 sibling, 2 replies; 93+ messages in thread From: Zhang, Yanmin @ 2009-01-24 2:55 UTC (permalink / raw) To: Christoph Lameter Cc: Pekka Enberg, Andi Kleen, Matthew Wilcox, Nick Piggin, Andrew Morton, netdev, Stephen Rothwell, matthew.r.wilcox, chinang.ma, linux-kernel, sharad.c.tripathi, arjan, suresh.b.siddha, harita.chilukuri, douglas.w.styner, peter.xihong.wang, hubert.nueckel, chris.mason, srostedt, linux-scsi, andrew.vasquez, anirban.chakraborty, Ingo Molnar On Fri, 2009-01-23 at 10:22 -0500, Christoph Lameter wrote: > On Fri, 23 Jan 2009, Pekka Enberg wrote: > > > Looking at __slab_free(), unless page->inuse is constantly zero and we > > discard the slab, it really is just cache effects (10% sounds like a > > lot, though!). AFAICT, the only way to optimize that is with Christoph's > > unfinished pointer freelists patches or with a remote free list like in > > SLQB. > > No there is another way. Increase the allocator order to 3 for the > kmalloc-8192 slab then multiple 8k blocks can be allocated from one of the > larger chunks of data gotten from the page allocator. That will allow slub > to do fast allocs. After I change kmalloc-8192/order to 3, the result(pinned netperf UDP-U-4k) difference between SLUB and SLQB becomes 1% which can be considered as fluctuation. But when trying to increased it to 4, I got: [root@lkp-st02-x8664 slab]# echo "3">kmalloc-8192/order [root@lkp-st02-x8664 slab]# echo "4">kmalloc-8192/order -bash: echo: write error: Invalid argument Comparing with SLQB, it seems SLUB needs too many investigation/manual finer-tuning against specific benchmarks. One hard is to tune page order number. Although SLQB also has many tuning options, I almost doesn't tune it manually, just run benchmark and collect results to compare. Does that mean the scalability of SLQB is better? ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Mainline kernel OLTP performance update 2009-01-24 2:55 ` Zhang, Yanmin @ 2009-01-24 7:36 ` Pekka Enberg 2009-02-12 5:22 ` Zhang, Yanmin 2009-01-26 17:36 ` Christoph Lameter 1 sibling, 1 reply; 93+ messages in thread From: Pekka Enberg @ 2009-01-24 7:36 UTC (permalink / raw) To: Zhang, Yanmin Cc: Christoph Lameter, Andi Kleen, Matthew Wilcox, Nick Piggin, Andrew Morton, netdev, Stephen Rothwell, matthew.r.wilcox, chinang.ma, linux-kernel, sharad.c.tripathi, arjan, suresh.b.siddha, harita.chilukuri, douglas.w.styner, peter.xihong.wang, hubert.nueckel, chris.mason, srostedt, linux-scsi, andrew.vasquez, anirban.chakraborty, Ingo Molnar On Fri, 2009-01-23 at 10:22 -0500, Christoph Lameter wrote: >> No there is another way. Increase the allocator order to 3 for the >> kmalloc-8192 slab then multiple 8k blocks can be allocated from one of the >> larger chunks of data gotten from the page allocator. That will allow slub >> to do fast allocs. On Sat, Jan 24, 2009 at 4:55 AM, Zhang, Yanmin <yanmin_zhang@linux.intel.com> wrote: > After I change kmalloc-8192/order to 3, the result(pinned netperf UDP-U-4k) > difference between SLUB and SLQB becomes 1% which can be considered as fluctuation. Great. We should fix calculate_order() to be order 3 for kmalloc-8192. Are you interested in doing that? On Sat, Jan 24, 2009 at 4:55 AM, Zhang, Yanmin <yanmin_zhang@linux.intel.com> wrote: > But when trying to increased it to 4, I got: > [root@lkp-st02-x8664 slab]# echo "3">kmalloc-8192/order > [root@lkp-st02-x8664 slab]# echo "4">kmalloc-8192/order > -bash: echo: write error: Invalid argument That's probably because max order is capped to 3. You can change that by passing slub_max_order=<n> as kernel parameter. On Sat, Jan 24, 2009 at 4:55 AM, Zhang, Yanmin <yanmin_zhang@linux.intel.com> wrote: > Comparing with SLQB, it seems SLUB needs too many investigation/manual finer-tuning > against specific benchmarks. One hard is to tune page order number. Although SLQB also > has many tuning options, I almost doesn't tune it manually, just run benchmark and > collect results to compare. Does that mean the scalability of SLQB is better? One thing is sure, SLUB seems to be hard to tune. Probably because it's dependent on the page order so much. ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Mainline kernel OLTP performance update 2009-01-24 7:36 ` Pekka Enberg @ 2009-02-12 5:22 ` Zhang, Yanmin 2009-02-12 5:47 ` Zhang, Yanmin 0 siblings, 1 reply; 93+ messages in thread From: Zhang, Yanmin @ 2009-02-12 5:22 UTC (permalink / raw) To: Pekka Enberg Cc: Christoph Lameter, Andi Kleen, Matthew Wilcox, Nick Piggin, Andrew Morton, netdev, Stephen Rothwell, matthew.r.wilcox, chinang.ma, linux-kernel, sharad.c.tripathi, arjan, suresh.b.siddha, harita.chilukuri, douglas.w.styner, peter.xihong.wang, hubert.nueckel, chris.mason, srostedt, linux-scsi, andrew.vasquez, anirban.chakraborty, Ingo Molnar On Sat, 2009-01-24 at 09:36 +0200, Pekka Enberg wrote: > On Fri, 2009-01-23 at 10:22 -0500, Christoph Lameter wrote: > >> No there is another way. Increase the allocator order to 3 for the > >> kmalloc-8192 slab then multiple 8k blocks can be allocated from one of the > >> larger chunks of data gotten from the page allocator. That will allow slub > >> to do fast allocs. > > On Sat, Jan 24, 2009 at 4:55 AM, Zhang, Yanmin > <yanmin_zhang@linux.intel.com> wrote: > > After I change kmalloc-8192/order to 3, the result(pinned netperf UDP-U-4k) > > difference between SLUB and SLQB becomes 1% which can be considered as fluctuation. > > Great. We should fix calculate_order() to be order 3 for kmalloc-8192. > Are you interested in doing that? Pekka, Sorry for the late update. The default order of kmalloc-8192 on 2*4 stoakley is really an issue of calculate_order. slab_size order name ------------------------------------------------- 4096 3 sgpool-128 8192 2 kmalloc-8192 16384 3 kmalloc-16384 kmalloc-8192's default order is smaller than sgpool-128's. On 4*4 tigerton machine, a similiar issue appears on another kmem_cache. Function calculate_order uses 'min_objects /= 2;' to shrink. Plus size calculation/checking in slab_order, sometimes above issue appear. Below patch against 2.6.29-rc2 fixes it. I checked the default orders of all kmem_cache and they don't become smaller than before. So the patch wouldn't hurt performance. Signed-off-by Zhang Yanmin <yanmin.zhang@linux.intel.com> --- diff -Nraup linux-2.6.29-rc2/mm/slub.c linux-2.6.29-rc2_slubcalc_order/mm/slub.c --- linux-2.6.29-rc2/mm/slub.c 2009-02-11 00:49:48.000000000 -0500 +++ linux-2.6.29-rc2_slubcalc_order/mm/slub.c 2009-02-12 00:08:24.000000000 -0500 @@ -1856,6 +1856,7 @@ static inline int calculate_order(int si min_objects = slub_min_objects; if (!min_objects) min_objects = 4 * (fls(nr_cpu_ids) + 1); + min_objects = min(min_objects, (PAGE_SIZE << slub_max_order)/size); while (min_objects > 1) { fraction = 16; while (fraction >= 4) { @@ -1865,7 +1866,7 @@ static inline int calculate_order(int si return order; fraction /= 2; } - min_objects /= 2; + min_objects --; } /* ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Mainline kernel OLTP performance update 2009-02-12 5:22 ` Zhang, Yanmin @ 2009-02-12 5:47 ` Zhang, Yanmin 2009-02-12 15:25 ` Christoph Lameter 2009-02-12 16:03 ` Pekka Enberg 0 siblings, 2 replies; 93+ messages in thread From: Zhang, Yanmin @ 2009-02-12 5:47 UTC (permalink / raw) To: Pekka Enberg Cc: Christoph Lameter, Andi Kleen, Matthew Wilcox, Nick Piggin, Andrew Morton, netdev, Stephen Rothwell, matthew.r.wilcox, chinang.ma, linux-kernel, sharad.c.tripathi, arjan, suresh.b.siddha, harita.chilukuri, douglas.w.styner, peter.xihong.wang, hubert.nueckel, chris.mason, srostedt, linux-scsi, andrew.vasquez, anirban.chakraborty, Ingo Molnar On Thu, 2009-02-12 at 13:22 +0800, Zhang, Yanmin wrote: > On Sat, 2009-01-24 at 09:36 +0200, Pekka Enberg wrote: > > On Fri, 2009-01-23 at 10:22 -0500, Christoph Lameter wrote: > > >> No there is another way. Increase the allocator order to 3 for the > > >> kmalloc-8192 slab then multiple 8k blocks can be allocated from one of the > > >> larger chunks of data gotten from the page allocator. That will allow slub > > >> to do fast allocs. > > > > On Sat, Jan 24, 2009 at 4:55 AM, Zhang, Yanmin > > <yanmin_zhang@linux.intel.com> wrote: > > > After I change kmalloc-8192/order to 3, the result(pinned netperf UDP-U-4k) > > > difference between SLUB and SLQB becomes 1% which can be considered as fluctuation. > > > > Great. We should fix calculate_order() to be order 3 for kmalloc-8192. > > Are you interested in doing that? > Pekka, > > Sorry for the late update. > The default order of kmalloc-8192 on 2*4 stoakley is really an issue of calculate_order. Oh, previous patch has a compiling warning. Pls. use below patch. From: Zhang Yanmin <yanmin.zhang@linux.intel.com> The default order of kmalloc-8192 on 2*4 stoakley is an issue of calculate_order. slab_size order name ------------------------------------------------- 4096 3 sgpool-128 8192 2 kmalloc-8192 16384 3 kmalloc-16384 kmalloc-8192's default order is smaller than sgpool-128's. On 4*4 tigerton machine, a similiar issue appears on another kmem_cache. Function calculate_order uses 'min_objects /= 2;' to shrink. Plus size calculation/checking in slab_order, sometimes above issue appear. Below patch against 2.6.29-rc2 fixes it. I checked the default orders of all kmem_cache and they don't become smaller than before. So the patch wouldn't hurt performance. Signed-off-by Zhang Yanmin <yanmin.zhang@linux.intel.com> --- --- linux-2.6.29-rc2/mm/slub.c 2009-02-11 00:49:48.000000000 -0500 +++ linux-2.6.29-rc2_slubcalc_order/mm/slub.c 2009-02-12 00:47:52.000000000 -0500 @@ -1844,6 +1844,7 @@ static inline int calculate_order(int si int order; int min_objects; int fraction; + int max_objects; /* * Attempt to find best configuration for a slab. This @@ -1856,6 +1857,9 @@ static inline int calculate_order(int si min_objects = slub_min_objects; if (!min_objects) min_objects = 4 * (fls(nr_cpu_ids) + 1); + max_objects = (PAGE_SIZE << slub_max_order)/size; + min_objects = min(min_objects, max_objects); + while (min_objects > 1) { fraction = 16; while (fraction >= 4) { @@ -1865,7 +1869,7 @@ static inline int calculate_order(int si return order; fraction /= 2; } - min_objects /= 2; + min_objects --; } /* ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Mainline kernel OLTP performance update 2009-02-12 5:47 ` Zhang, Yanmin @ 2009-02-12 15:25 ` Christoph Lameter 2009-02-12 16:07 ` Pekka Enberg 2009-02-12 16:03 ` Pekka Enberg 1 sibling, 1 reply; 93+ messages in thread From: Christoph Lameter @ 2009-02-12 15:25 UTC (permalink / raw) To: Zhang, Yanmin Cc: Pekka Enberg, Andi Kleen, Matthew Wilcox, Nick Piggin, Andrew Morton, netdev, Stephen Rothwell, matthew.r.wilcox, chinang.ma, linux-kernel, sharad.c.tripathi, arjan, suresh.b.siddha, harita.chilukuri, douglas.w.styner, peter.xihong.wang, hubert.nueckel, chris.mason, srostedt, linux-scsi, andrew.vasquez, anirban.chakraborty, Ingo Molnar [-- Attachment #1: Type: TEXT/PLAIN, Size: 679 bytes --] On Thu, 12 Feb 2009, Zhang, Yanmin wrote: > The default order of kmalloc-8192 on 2*4 stoakley is an issue of calculate_order. > > > slab_size order name > ------------------------------------------------- > 4096 3 sgpool-128 > 8192 2 kmalloc-8192 > 16384 3 kmalloc-16384 > > kmalloc-8192's default order is smaller than sgpool-128's. You reverted the page allocator passthrough patch before this right? Otherwise kmalloc-8192 should not exist and allocation calls for 8192 bytes would be converted inline to request of an order 1 page from the page allocator. ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Mainline kernel OLTP performance update 2009-02-12 15:25 ` Christoph Lameter @ 2009-02-12 16:07 ` Pekka Enberg 0 siblings, 0 replies; 93+ messages in thread From: Pekka Enberg @ 2009-02-12 16:07 UTC (permalink / raw) To: Christoph Lameter Cc: Zhang, Yanmin, Andi Kleen, Matthew Wilcox, Nick Piggin, Andrew Morton, netdev, Stephen Rothwell, matthew.r.wilcox, chinang.ma, linux-kernel, sharad.c.tripathi, arjan, suresh.b.siddha, harita.chilukuri, douglas.w.styner, peter.xihong.wang, hubert.nueckel, chris.mason, srostedt, linux-scsi, andrew.vasquez, anirban.chakraborty, Ingo Molnar Hi Christoph, On Thu, 12 Feb 2009, Zhang, Yanmin wrote: >> The default order of kmalloc-8192 on 2*4 stoakley is an issue of calculate_order. >> >> >> slab_size order name >> ------------------------------------------------- >> 4096 3 sgpool-128 >> 8192 2 kmalloc-8192 >> 16384 3 kmalloc-16384 >> >> kmalloc-8192's default order is smaller than sgpool-128's. On Thu, Feb 12, 2009 at 5:25 PM, Christoph Lameter <cl@linux-foundation.org> wrote: > You reverted the page allocator passthrough patch before this right? > Otherwise kmalloc-8192 should not exist and allocation calls for 8192 > bytes would be converted inline to request of an order 1 page from the > page allocator. Yup, I assume that's the case here. ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Mainline kernel OLTP performance update 2009-02-12 5:47 ` Zhang, Yanmin 2009-02-12 15:25 ` Christoph Lameter @ 2009-02-12 16:03 ` Pekka Enberg 1 sibling, 0 replies; 93+ messages in thread From: Pekka Enberg @ 2009-02-12 16:03 UTC (permalink / raw) To: Zhang, Yanmin Cc: Christoph Lameter, Andi Kleen, Matthew Wilcox, Nick Piggin, Andrew Morton, netdev, Stephen Rothwell, matthew.r.wilcox, chinang.ma, linux-kernel, sharad.c.tripathi, arjan, suresh.b.siddha, harita.chilukuri, douglas.w.styner, peter.xihong.wang, hubert.nueckel, chris.mason, srostedt, linux-scsi, andrew.vasquez, anirban.chakraborty, Ingo Molnar On Sat, 2009-01-24 at 09:36 +0200, Pekka Enberg wrote: > > > On Fri, 2009-01-23 at 10:22 -0500, Christoph Lameter wrote: > > > >> No there is another way. Increase the allocator order to 3 for the > > > >> kmalloc-8192 slab then multiple 8k blocks can be allocated from one of the > > > >> larger chunks of data gotten from the page allocator. That will allow slub > > > >> to do fast allocs. > > > > > > On Sat, Jan 24, 2009 at 4:55 AM, Zhang, Yanmin > > > <yanmin_zhang@linux.intel.com> wrote: > > > > After I change kmalloc-8192/order to 3, the result(pinned netperf UDP-U-4k) > > > > difference between SLUB and SLQB becomes 1% which can be considered as fluctuation. > > > > > > Great. We should fix calculate_order() to be order 3 for kmalloc-8192. > > > Are you interested in doing that? On Thu, 2009-02-12 at 13:22 +0800, Zhang, Yanmin wrote: > > Pekka, > > > > Sorry for the late update. > > The default order of kmalloc-8192 on 2*4 stoakley is really an issue of calculate_order. On Thu, 2009-02-12 at 13:47 +0800, Zhang, Yanmin wrote: > Oh, previous patch has a compiling warning. Pls. use below patch. > > From: Zhang Yanmin <yanmin.zhang@linux.intel.com> > > The default order of kmalloc-8192 on 2*4 stoakley is an issue of calculate_order. Applied to the 'topic/slub/perf' branch. Thanks! Pekka ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Mainline kernel OLTP performance update 2009-01-24 2:55 ` Zhang, Yanmin 2009-01-24 7:36 ` Pekka Enberg @ 2009-01-26 17:36 ` Christoph Lameter 2009-02-01 2:52 ` Zhang, Yanmin 1 sibling, 1 reply; 93+ messages in thread From: Christoph Lameter @ 2009-01-26 17:36 UTC (permalink / raw) To: Zhang, Yanmin Cc: Pekka Enberg, Andi Kleen, Matthew Wilcox, Nick Piggin, Andrew Morton, netdev, Stephen Rothwell, matthew.r.wilcox, chinang.ma, linux-kernel, sharad.c.tripathi, arjan, suresh.b.siddha, harita.chilukuri, douglas.w.styner, peter.xihong.wang, hubert.nueckel, chris.mason, srostedt, linux-scsi, andrew.vasquez, anirban.chakraborty, Ingo Molnar On Sat, 24 Jan 2009, Zhang, Yanmin wrote: > But when trying to increased it to 4, I got: > [root@lkp-st02-x8664 slab]# echo "3">kmalloc-8192/order > [root@lkp-st02-x8664 slab]# echo "4">kmalloc-8192/order > -bash: echo: write error: Invalid argument This is because 4 is more than the maximum allowed order. You can reconfigure that by setting slub_max_order=5 or so on boot. ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Mainline kernel OLTP performance update 2009-01-26 17:36 ` Christoph Lameter @ 2009-02-01 2:52 ` Zhang, Yanmin 0 siblings, 0 replies; 93+ messages in thread From: Zhang, Yanmin @ 2009-02-01 2:52 UTC (permalink / raw) To: Christoph Lameter Cc: Pekka Enberg, Andi Kleen, Matthew Wilcox, Nick Piggin, Andrew Morton, netdev, Stephen Rothwell, matthew.r.wilcox, chinang.ma, linux-kernel, sharad.c.tripathi, arjan, suresh.b.siddha, harita.chilukuri, douglas.w.styner, peter.xihong.wang, hubert.nueckel, chris.mason, srostedt, linux-scsi, andrew.vasquez, anirban.chakraborty, Ingo Molnar On Mon, 2009-01-26 at 12:36 -0500, Christoph Lameter wrote: > On Sat, 24 Jan 2009, Zhang, Yanmin wrote: > > > But when trying to increased it to 4, I got: > > [root@lkp-st02-x8664 slab]# echo "3">kmalloc-8192/order > > [root@lkp-st02-x8664 slab]# echo "4">kmalloc-8192/order > > -bash: echo: write error: Invalid argument > > This is because 4 is more than the maximum allowed order. You can > reconfigure that by setting > > slub_max_order=5 > > or so on boot. With slub_max_order=5, the default order of kmalloc-8192 becomes 5. I tested it with netperf UDP-U-4k and the result difference from SLAB/SLQB is less than 1% which is really fluctuation. ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Mainline kernel OLTP performance update 2009-01-23 3:02 ` Zhang, Yanmin 2009-01-23 6:52 ` Pekka Enberg @ 2009-01-23 8:33 ` Nick Piggin 2009-01-23 9:02 ` Zhang, Yanmin 1 sibling, 1 reply; 93+ messages in thread From: Nick Piggin @ 2009-01-23 8:33 UTC (permalink / raw) To: Zhang, Yanmin Cc: Pekka Enberg, Christoph Lameter, Andi Kleen, Matthew Wilcox, Andrew Morton, netdev, sfr, matthew.r.wilcox, chinang.ma, linux-kernel, sharad.c.tripathi, arjan, suresh.b.siddha, harita.chilukuri, douglas.w.styner, peter.xihong.wang, hubert.nueckel, chris.mason, srostedt, linux-scsi, andrew.vasquez, anirban.chakraborty On Friday 23 January 2009 14:02:53 Zhang, Yanmin wrote: > 1) If I start CPU_NUM clients and servers, SLUB's result is about 2% better > than SLQB's; I'll have to look into this too. Could be evidence of the possible TLB improvement from using bigger pages and/or page-specific freelist, I suppose. Do you have a scripted used to start netperf in that configuration? ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Mainline kernel OLTP performance update 2009-01-23 8:33 ` Nick Piggin @ 2009-01-23 9:02 ` Zhang, Yanmin 2009-01-23 18:40 ` care and feeding of netperf (Re: Mainline kernel OLTP performance update) Rick Jones 0 siblings, 1 reply; 93+ messages in thread From: Zhang, Yanmin @ 2009-01-23 9:02 UTC (permalink / raw) To: Nick Piggin Cc: Pekka Enberg, Christoph Lameter, Andi Kleen, Matthew Wilcox, Andrew Morton, netdev, sfr, matthew.r.wilcox, chinang.ma, linux-kernel, sharad.c.tripathi, arjan, suresh.b.siddha, harita.chilukuri, douglas.w.styner, peter.xihong.wang, hubert.nueckel, chris.mason, srostedt, linux-scsi, andrew.vasquez, anirban.chakraborty [-- Attachment #1: Type: text/plain, Size: 622 bytes --] On Fri, 2009-01-23 at 19:33 +1100, Nick Piggin wrote: > On Friday 23 January 2009 14:02:53 Zhang, Yanmin wrote: > > > 1) If I start CPU_NUM clients and servers, SLUB's result is about 2% better > > than SLQB's; > > I'll have to look into this too. Could be evidence of the possible > TLB improvement from using bigger pages and/or page-specific freelist, > I suppose. > > Do you have a scripted used to start netperf in that configuration? See the attachment. Steps to run testing: 1) compile netperf; 2) Change PROG_DIR to path/to/netperf/src; 3) ./start_netperf_udp_v4.sh 8 #Assume your machine has 8 logical cpus. [-- Attachment #2: start_netperf_udp_v4.sh --] [-- Type: application/x-shellscript, Size: 1361 bytes --] ^ permalink raw reply [flat|nested] 93+ messages in thread
* care and feeding of netperf (Re: Mainline kernel OLTP performance update) 2009-01-23 9:02 ` Zhang, Yanmin @ 2009-01-23 18:40 ` Rick Jones 2009-01-23 18:51 ` Grant Grundler 2009-01-24 3:03 ` Zhang, Yanmin 0 siblings, 2 replies; 93+ messages in thread From: Rick Jones @ 2009-01-23 18:40 UTC (permalink / raw) To: Zhang, Yanmin Cc: Nick Piggin, Pekka Enberg, Christoph Lameter, Andi Kleen, Matthew Wilcox, Andrew Morton, netdev, sfr, matthew.r.wilcox, chinang.ma, linux-kernel, sharad.c.tripathi, arjan, suresh.b.siddha, harita.chilukuri, douglas.w.styner, peter.xihong.wang, hubert.nueckel, chris.mason, srostedt, linux-scsi, andrew.vasquez, anirban.chakraborty > 3) ./start_netperf_udp_v4.sh 8 #Assume your machine has 8 logical cpus. Some comments on the script: > #!/bin/sh > > PROG_DIR=/home/ymzhang/test/netperf/src > date=`date +%H%M%N` > #PROG_DIR=/root/netperf/netperf/src > client_num=$1 > pin_cpu=$2 > > start_port_server=12384 > start_port_client=15888 > > killall netserver > ${PROG_DIR}/netserver > sleep 2 Any particular reason for killing-off the netserver daemon? > if [ ! -d result ]; then > mkdir result > fi > > all_result_files="" > for i in `seq 1 ${client_num}`; do > if [ "${pin_cpu}" == "pin" ]; then > pin_param="-T ${i} ${i}" The -T option takes arguments of the form: N - bind both netperf and netserver to core N N, - bind only netperf to core N, float netserver ,M - float netperf, bind only netserver to core M N,M - bind netperf to core N and netserver to core M Without a comma between N and M knuth only knows what the command line parser will do :) > fi > result_file=result/netperf_${start_port_client}.${date} > #./netperf -t UDP_STREAM -l 60 -H 127.0.0.1 -- -P 15895 12391 -s 32768 -S 32768 -m 4096 > #./netperf -t UDP_STREAM -l 60 -H 127.0.0.1 -i 50 3 -I 99 5 -- -P 12384 12888 -s 32768 -S 32768 -m 4096 > #${PROG_DIR}/netperf -p ${port_num} -t TCP_RR -l 60 -H 127.0.0.1 ${pin_param} -- -r 1,1 >${result_file} & > ${PROG_DIR}/netperf -t UDP_STREAM -l 60 -H 127.0.0.1 ${pin_param} -- -P ${start_port_client} ${start_port_server} -s 32768 -S 32768 -m 4096 >${result_file} & Same thing here for the -P option - there needs to be a comma between the two port numbers otherwise, the best case is that the second port number is ignored. Worst case is that netperf starts doing knuth only knows what. To get quick profiles, that form of aggregate netperf is OK - just the one iteration with background processes using a moderatly long run time. However, for result reporting, it is best to (ab)use the confidence intervals functionality to try to avoid skew errors. I tend to add-in a global -i 30 option to get each netperf to repeat its measurments 30 times. That way one is reasonably confident that skew issues are minimized. http://www.netperf.org/svn/netperf2/trunk/doc/netperf.html#Using-Netperf-to-Measure-Aggregate-Performance And I would probably add the -c and -C options to have netperf report service demands. > sub_pid="${sub_pid} `echo $!`" > port_num=$((${port_num}+1)) > all_result_files="${all_result_files} ${result_file}" > start_port_server=$((${start_port_server}+1)) > start_port_client=$((${start_port_client}+1)) > done; > > wait ${sub_pid} > killall netserver > > result="0" > for i in `echo ${all_result_files}`; do > sub_result=`awk '/Throughput/ {getline; getline; getline; print " "$6}' ${i}` > result=`echo "${result}+${sub_result}"|bc` > done; The documented-only-in-source :( "omni" tests in top-of-trunk netperf: http://www.netperf.org/svn/netperf2/trunk ./configure --enable-omni allow one to specify which result values one wants, in which order, either as more or less traditional netperf output (test-specific -O), CSV (test-specific -o) or keyval (test-specific -k). All three take an optional filename as an argument with the file containing a list of desired output values. You can give a "filename" of '?' to get the list of output values known to that version of netperf. Might help simplify parsing and whatnot. happy benchmarking, rick jones > > echo $result > ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: care and feeding of netperf (Re: Mainline kernel OLTP performance update) 2009-01-23 18:40 ` care and feeding of netperf (Re: Mainline kernel OLTP performance update) Rick Jones @ 2009-01-23 18:51 ` Grant Grundler 2009-01-24 3:03 ` Zhang, Yanmin 1 sibling, 0 replies; 93+ messages in thread From: Grant Grundler @ 2009-01-23 18:51 UTC (permalink / raw) To: Rick Jones Cc: Zhang, Yanmin, Nick Piggin, Pekka Enberg, Christoph Lameter, Andi Kleen, Matthew Wilcox, Andrew Morton, netdev, sfr, matthew.r.wilcox, chinang.ma, linux-kernel, sharad.c.tripathi, arjan, suresh.b.siddha, harita.chilukuri, douglas.w.styner, peter.xihong.wang, hubert.nueckel, chris.mason, srostedt, linux-scsi, andrew.vasquez, anirban.chakraborty On Fri, Jan 23, 2009 at 10:40 AM, Rick Jones <rick.jones2@hp.com> wrote: ... > And I would probably add the -c and -C options to have netperf report > service demands. For performance analysis, the service demand is often more interesting than the absolute performance (which typically only varies a few Mb/s for gigE NICs). I strongly encourage adding -c and -C. grant ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: care and feeding of netperf (Re: Mainline kernel OLTP performance update) 2009-01-23 18:40 ` care and feeding of netperf (Re: Mainline kernel OLTP performance update) Rick Jones 2009-01-23 18:51 ` Grant Grundler @ 2009-01-24 3:03 ` Zhang, Yanmin 2009-01-26 18:26 ` Rick Jones 1 sibling, 1 reply; 93+ messages in thread From: Zhang, Yanmin @ 2009-01-24 3:03 UTC (permalink / raw) To: Rick Jones Cc: Nick Piggin, Pekka Enberg, Christoph Lameter, Andi Kleen, Matthew Wilcox, Andrew Morton, netdev, sfr, matthew.r.wilcox, chinang.ma, linux-kernel, sharad.c.tripathi, arjan, suresh.b.siddha, harita.chilukuri, douglas.w.styner, peter.xihong.wang, hubert.nueckel, chris.mason, srostedt, linux-scsi, andrew.vasquez, anirban.chakraborty On Fri, 2009-01-23 at 10:40 -0800, Rick Jones wrote: > > 3) ./start_netperf_udp_v4.sh 8 #Assume your machine has 8 logical cpus. > > Some comments on the script: Thanks. I wanted to run the testing to get result quickly as long as the result has no big fluctuation. > > > #!/bin/sh > > > > PROG_DIR=/home/ymzhang/test/netperf/src > > date=`date +%H%M%N` > > #PROG_DIR=/root/netperf/netperf/src > > client_num=$1 > > pin_cpu=$2 > > > > start_port_server=12384 > > start_port_client=15888 > > > > killall netserver > > ${PROG_DIR}/netserver > > sleep 2 > > Any particular reason for killing-off the netserver daemon? I'm not sure if prior running might leave any impact on later running, so just kill netserver. > > > if [ ! -d result ]; then > > mkdir result > > fi > > > > all_result_files="" > > for i in `seq 1 ${client_num}`; do > > if [ "${pin_cpu}" == "pin" ]; then > > pin_param="-T ${i} ${i}" > > The -T option takes arguments of the form: > > N - bind both netperf and netserver to core N > N, - bind only netperf to core N, float netserver > ,M - float netperf, bind only netserver to core M > N,M - bind netperf to core N and netserver to core M > > Without a comma between N and M knuth only knows what the command line parser > will do :) > > > fi > > result_file=result/netperf_${start_port_client}.${date} > > #./netperf -t UDP_STREAM -l 60 -H 127.0.0.1 -- -P 15895 12391 -s 32768 -S 32768 -m 4096 > > #./netperf -t UDP_STREAM -l 60 -H 127.0.0.1 -i 50 3 -I 99 5 -- -P 12384 12888 -s 32768 -S 32768 -m 4096 > > #${PROG_DIR}/netperf -p ${port_num} -t TCP_RR -l 60 -H 127.0.0.1 ${pin_param} -- -r 1,1 >${result_file} & > > ${PROG_DIR}/netperf -t UDP_STREAM -l 60 -H 127.0.0.1 ${pin_param} -- -P ${start_port_client} ${start_port_server} -s 32768 -S 32768 -m 4096 >${result_file} & > > Same thing here for the -P option - there needs to be a comma between the two > port numbers otherwise, the best case is that the second port number is ignored. > Worst case is that netperf starts doing knuth only knows what. Thanks. > > > To get quick profiles, that form of aggregate netperf is OK - just the one > iteration with background processes using a moderatly long run time. However, > for result reporting, it is best to (ab)use the confidence intervals > functionality to try to avoid skew errors. Yes. My formal testing uses -i 50. I just wanted a quick testing. If I need finer-tuning or investigation, I would turn on more options. > I tend to add-in a global -i 30 > option to get each netperf to repeat its measurments 30 times. That way one is > reasonably confident that skew issues are minimized. > > http://www.netperf.org/svn/netperf2/trunk/doc/netperf.html#Using-Netperf-to-Measure-Aggregate-Performance > > And I would probably add the -c and -C options to have netperf report service > demands. Yes. That's good. I'm used to start vmstat or mpstat to monitor cpu utilization in real time. > > > > sub_pid="${sub_pid} `echo $!`" > > port_num=$((${port_num}+1)) > > all_result_files="${all_result_files} ${result_file}" > > start_port_server=$((${start_port_server}+1)) > > start_port_client=$((${start_port_client}+1)) > > done; > > > > wait ${sub_pid} > > killall netserver > > > > result="0" > > for i in `echo ${all_result_files}`; do > > sub_result=`awk '/Throughput/ {getline; getline; getline; print " "$6}' ${i}` > > result=`echo "${result}+${sub_result}"|bc` > > done; > > The documented-only-in-source :( "omni" tests in top-of-trunk netperf: > > http://www.netperf.org/svn/netperf2/trunk > > ./configure --enable-omni > > allow one to specify which result values one wants, in which order, either as > more or less traditional netperf output (test-specific -O), CSV (test-specific > -o) or keyval (test-specific -k). All three take an optional filename as an > argument with the file containing a list of desired output values. You can give > a "filename" of '?' to get the list of output values known to that version of > netperf. > > Might help simplify parsing and whatnot. Yes, it does. > > happy benchmarking, > > rick jones Thanks again. I learned a lot. > > > > > echo $result > > > ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: care and feeding of netperf (Re: Mainline kernel OLTP performance update) 2009-01-24 3:03 ` Zhang, Yanmin @ 2009-01-26 18:26 ` Rick Jones 0 siblings, 0 replies; 93+ messages in thread From: Rick Jones @ 2009-01-26 18:26 UTC (permalink / raw) To: Zhang, Yanmin Cc: Nick Piggin, Pekka Enberg, Christoph Lameter, Andi Kleen, Matthew Wilcox, Andrew Morton, netdev, sfr, matthew.r.wilcox, chinang.ma, linux-kernel, sharad.c.tripathi, arjan, suresh.b.siddha, harita.chilukuri, douglas.w.styner, peter.xihong.wang, hubert.nueckel, chris.mason, srostedt, linux-scsi, andrew.vasquez, anirban.chakraborty >>To get quick profiles, that form of aggregate netperf is OK - just the one >>iteration with background processes using a moderatly long run time. However, >>for result reporting, it is best to (ab)use the confidence intervals >>functionality to try to avoid skew errors. > > Yes. My formal testing uses -i 50. I just wanted a quick testing. If I need > finer-tuning or investigation, I would turn on more options. Netperf will silently clip that to 30 as that is all the built-in tables know. > Thanks again. I learned a lot. Feel free to wander over to netperf-talk over at netperf.org if you want to talk some more about the care and feeding of netperf. happy benchmarking, rick jones ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Mainline kernel OLTP performance update 2009-01-16 6:46 ` Nick Piggin 2009-01-16 6:55 ` Matthew Wilcox @ 2009-01-16 7:00 ` Andrew Morton 2009-01-16 7:25 ` Nick Piggin 2009-01-16 8:59 ` Nick Piggin 2009-01-16 18:11 ` Rick Jones 2 siblings, 2 replies; 93+ messages in thread From: Andrew Morton @ 2009-01-16 7:00 UTC (permalink / raw) To: Nick Piggin Cc: netdev, sfr, matthew, matthew.r.wilcox, chinang.ma, linux-kernel, sharad.c.tripathi, arjan, andi.kleen, suresh.b.siddha, harita.chilukuri, douglas.w.styner, peter.xihong.wang, hubert.nueckel, chris.mason, srostedt, linux-scsi, andrew.vasquez, anirban.chakraborty On Fri, 16 Jan 2009 17:46:23 +1100 Nick Piggin <nickpiggin@yahoo.com.au> wrote: > On Friday 16 January 2009 15:12:10 Andrew Morton wrote: > > On Fri, 16 Jan 2009 15:03:12 +1100 Nick Piggin <nickpiggin@yahoo.com.au> > wrote: > > > I would like to see SLQB merged in mainline, made default, and wait for > > > some number releases. Then we take what we know, and try to make an > > > informed decision about the best one to take. I guess that is problematic > > > in that the rest of the kernel is moving underneath us. Do you have > > > another idea? > > > > Nope. If it doesn't work out, we can remove it again I guess. > > OK, I have these numbers to show I'm not completely off my rocker to suggest > we merge SLQB :) Given these results, how about I ask to merge SLQB as default > in linux-next, then if nothing catastrophic happens, merge it upstream in the > next merge window, then a couple of releases after that, given some time to > test and tweak SLQB, then we plan to bite the bullet and emerge with just one > main slab allocator (plus SLOB). That's a plan. > SLQB tends to be the winner here. Can you think of anything with which it will be the loser? ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Mainline kernel OLTP performance update 2009-01-16 7:00 ` Mainline kernel OLTP performance update Andrew Morton @ 2009-01-16 7:25 ` Nick Piggin 2009-01-16 8:59 ` Nick Piggin 1 sibling, 0 replies; 93+ messages in thread From: Nick Piggin @ 2009-01-16 7:25 UTC (permalink / raw) To: Andrew Morton Cc: netdev, sfr, matthew, matthew.r.wilcox, chinang.ma, linux-kernel, sharad.c.tripathi, arjan, andi.kleen, suresh.b.siddha, harita.chilukuri, douglas.w.styner, peter.xihong.wang, hubert.nueckel, chris.mason, srostedt, linux-scsi, andrew.vasquez, anirban.chakraborty On Friday 16 January 2009 18:00:43 Andrew Morton wrote: > On Fri, 16 Jan 2009 17:46:23 +1100 Nick Piggin <nickpiggin@yahoo.com.au> wrote: > > On Friday 16 January 2009 15:12:10 Andrew Morton wrote: > > > On Fri, 16 Jan 2009 15:03:12 +1100 Nick Piggin > > > <nickpiggin@yahoo.com.au> > > > > wrote: > > > > I would like to see SLQB merged in mainline, made default, and wait > > > > for some number releases. Then we take what we know, and try to make > > > > an informed decision about the best one to take. I guess that is > > > > problematic in that the rest of the kernel is moving underneath us. > > > > Do you have another idea? > > > > > > Nope. If it doesn't work out, we can remove it again I guess. > > > > OK, I have these numbers to show I'm not completely off my rocker to > > suggest we merge SLQB :) Given these results, how about I ask to merge > > SLQB as default in linux-next, then if nothing catastrophic happens, > > merge it upstream in the next merge window, then a couple of releases > > after that, given some time to test and tweak SLQB, then we plan to bite > > the bullet and emerge with just one main slab allocator (plus SLOB). > > That's a plan. > > > SLQB tends to be the winner here. > > Can you think of anything with which it will be the loser? Well, that fio test showed it was behind SLAB. I just discovered that yesterday during running these tests, so I'll take a look at that. The Intel performance guys I think have one or two cases where it is slower. They don't seem to be too serious, and tend to be specific to some machines (eg. the same test with a different CPU architecture turns out to be faster). So I'll be looking into these things, but I haven't seen anything too serious yet. I'm mostly interested in macro benchmarks and more real world workloads. At a higher level, SLAB has some interesting features. It basically has "crossbars" of queues, that basically provide queues for allocating and freeing to and from different CPUs and nodes. This is what bloats up the kmem_cache data structures to tens or hundreds of gigabytes each on SGI size systems. But it is also has good properties. On smaller multiprocessor and NUMA systems, it might be the case that SLAB does better in workloads that involve objects being allocated on one CPU and freed on another. I haven't actually observed problems here, but I don't have a lot of good tests. SLAB is also fundamentally different from SLUB and SLQB in that it uses arrays to store pointers to objects in its queues, rather than having a linked list using pointers embedded in the objects. This might in some cases make it easier to prefetch objects in parallel with finding the object itself. I haven't actually been able to attribute a particular regression to this interesting difference, but it might turn up as an issue. These are two big differences between SLAB and SLQB. The linked lists of objects were used in favour of arrays again because of the memory overhead, and to have a better ability to tune the size of the queues, and reduced overhead in copying around arrays of pointers (SLQB can just copy the head of one the list to the tail of another in order to move objects around), and eliminated the need to have additional metadata beyond the struct page for each slab. The crossbars of queues were removed because of the bloating and memory overhead issues. The fact that we now have linked lists helps a little bit with this, because moving lists of objects around gets a bit easier. ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Mainline kernel OLTP performance update 2009-01-16 7:00 ` Mainline kernel OLTP performance update Andrew Morton 2009-01-16 7:25 ` Nick Piggin @ 2009-01-16 8:59 ` Nick Piggin 1 sibling, 0 replies; 93+ messages in thread From: Nick Piggin @ 2009-01-16 8:59 UTC (permalink / raw) To: Andrew Morton Cc: netdev, sfr, matthew, matthew.r.wilcox, chinang.ma, linux-kernel, sharad.c.tripathi, arjan, andi.kleen, suresh.b.siddha, harita.chilukuri, douglas.w.styner, peter.xihong.wang, hubert.nueckel, chris.mason, srostedt, linux-scsi, andrew.vasquez, anirban.chakraborty On Friday 16 January 2009 18:00:43 Andrew Morton wrote: > On Fri, 16 Jan 2009 17:46:23 +1100 Nick Piggin <nickpiggin@yahoo.com.au> > > SLQB tends to be the winner here. > > Can you think of anything with which it will be the loser? Here are some more performance numbers with "slub_test" kernel module. It's basically a really tiny microbenchmark, so I don't really consider it gives too useful results, except it does show up some problems in SLAB's scalability that may start to bite as we continue to get more threads per socket. (I ran a few of these tests on one of Dave's 2 socket, 128 thread systems, and slab gets really painful... these kinds of thread counts may only be a couple of years away from x86). All numbers are in CPU cycles. Single thread testing ===================== 1. Kmalloc: Repeatedly allocate 10000 objs then free them obj size SLAB SLQB SLUB 8 77+ 128 69+ 47 61+ 77 16 69+ 104 116+ 70 77+ 80 32 66+ 101 82+ 81 71+ 89 64 82+ 116 95+ 81 94+105 128 100+ 148 106+ 94 114+163 256 153+ 136 134+ 98 124+186 512 209+ 161 170+186 134+276 1024 331+ 249 236+245 134+283 2048 608+ 443 380+386 172+312 4096 1109+ 624 678+661 239+372 8192 1166+1077 767+683 535+433 16384 1213+1160 914+731 577+682 We can see SLAB has a fair bit more overhead in this case. SLUB starts doing higher order allocations I think around size 256, which reduces costs there. Don't know what the SLQB artifact at 16 is caused by... 2. Kmalloc: alloc/free test (repeatedly allocate and free) SLAB SLQB SLUB 8 98 90 94 16 98 90 93 32 98 90 93 64 99 90 94 128 100 92 93 256 104 93 95 512 105 94 97 1024 106 93 97 2048 107 95 95 4096 111 92 97 8192 111 94 631 16384 114 92 741 Here we see SLUB's allocator passthrough (or is the the lack of queueing?). Straight line speed at small sizes is probably due to instructions in the fastpaths. It's pretty meaningless though because it probably changes if there is any actual load on the CPU, or another CPU architecture. Doesn't look bad for SLQB though :) Concurrent allocs ================= 1. Like the first single thread test, lots of allocs, then lots of frees. But running on all CPUs. Average over all CPUs. SLAB SLQB SLUB 8 251+ 322 73+ 47 65+ 76 16 240+ 331 84+ 53 67+ 82 32 235+ 316 94+ 57 77+ 92 64 338+ 303 120+ 66 105+ 136 128 549+ 355 139+ 166 127+ 344 256 1129+ 456 189+ 178 236+ 404 512 2085+ 872 240+ 217 244+ 419 1024 3895+1373 347+ 333 251+ 440 2048 7725+2579 616+ 695 373+ 588 4096 15320+4534 1245+1442 689+1002 A problem with SLAB scalability starts showing up on this system with only 4 threads per socket. Again, SLUB sees a benefit from higher order allocations. 2. Same as 2nd single threaded test, alloc then free, on all CPUs. SLAB SLQB SLUB 8 99 90 93 16 99 90 93 32 99 90 93 64 100 91 94 128 102 90 93 256 105 94 97 512 106 93 97 1024 108 93 97 2048 109 93 96 4096 110 93 96 No surprises. Objects always fit in queues (or unqueues, in the case of SLUB), so there is no cross cache traffic. Remote free test ================ 1. Allocate N objects on CPUs 1-7, then free them all from CPU 0. Average cost of all kmalloc+kfree SLAB SLQB SLUB 8 191+ 142 53+ 64 56+99 16 180+ 141 82+ 69 60+117 32 173+ 142 100+ 71 78+151 64 240+ 147 131+ 73 117+216 128 441+ 162 158+114 114+251 256 833+ 181 179+119 185+263 512 1546+ 243 220+132 194+292 1024 2886+ 341 299+135 201+312 2048 5737+ 577 517+139 291+370 4096 11288+1201 976+153 528+482 2. All CPUs allocate on objects on CPU N, then freed by CPU N+1 % NR_CPUS (ie. CPU1 frees objects allocated by CPU0). SLAB SLQB SLUB 8 236+ 331 72+123 64+ 114 16 232+ 345 80+125 71+ 139 32 227+ 342 85+134 82+ 183 64 324+ 336 140+138 111+ 219 128 569+ 384 245+201 145+ 337 256 1111+ 448 243+222 238+ 447 512 2091+ 871 249+244 247+ 470 1024 3923+1593 254+256 254+ 503 2048 7700+2968 273+277 369+ 699 4096 15154+5061 310+323 693+1220 SLAB's concurrent allocation bottlnecks show up again in these tests. Unfortunately these are not too realistic tests of remote freeing pattern, because normally you would expect remote freeing and allocation happening concurrently, rather than all allocations up front, then all frees. If the test behaved like that, then object could probably fit in SLAB's queues and it might see some good numbers. ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Mainline kernel OLTP performance update 2009-01-16 6:46 ` Nick Piggin 2009-01-16 6:55 ` Matthew Wilcox 2009-01-16 7:00 ` Mainline kernel OLTP performance update Andrew Morton @ 2009-01-16 18:11 ` Rick Jones 2009-01-19 7:43 ` Nick Piggin 2 siblings, 1 reply; 93+ messages in thread From: Rick Jones @ 2009-01-16 18:11 UTC (permalink / raw) To: Nick Piggin Cc: Andrew Morton, netdev, sfr, matthew, matthew.r.wilcox, chinang.ma, linux-kernel, sharad.c.tripathi, arjan, andi.kleen, suresh.b.siddha, harita.chilukuri, douglas.w.styner, peter.xihong.wang, hubert.nueckel, chris.mason, srostedt, linux-scsi, andrew.vasquez, anirban.chakraborty Nick Piggin wrote: > OK, I have these numbers to show I'm not completely off my rocker to suggest > we merge SLQB :) Given these results, how about I ask to merge SLQB as default > in linux-next, then if nothing catastrophic happens, merge it upstream in the > next merge window, then a couple of releases after that, given some time to > test and tweak SLQB, then we plan to bite the bullet and emerge with just one > main slab allocator (plus SLOB). > > > System is a 2socket, 4 core AMD. Not exactly a large system :) Barely NUMA even with just two sockets. > All debug and stats options turned off for > all the allocators; default parameters (ie. SLUB using higher order pages, > and the others tend to be using order-0). SLQB is the version I recently > posted, with some of the prefetching removed according to Pekka's review > (probably a good idea to only add things like that in if/when they prove to > be an improvement). > > ... > > Netperf UDP unidirectional send test (10 runs, higher better): > > Server and client bound to same CPU > SLAB AVG=60.111 STD=1.59382 > SLQB AVG=60.167 STD=0.685347 > SLUB AVG=58.277 STD=0.788328 > > Server and client bound to same socket, different CPUs > SLAB AVG=85.938 STD=0.875794 > SLQB AVG=93.662 STD=2.07434 > SLUB AVG=81.983 STD=0.864362 > > Server and client bound to different sockets > SLAB AVG=78.801 STD=1.44118 > SLQB AVG=78.269 STD=1.10457 > SLUB AVG=71.334 STD=1.16809 > ... > I haven't done any non-local network tests. Networking is the one of the > subsystems most heavily dependent on slab performance, so if anybody > cares to run their favourite tests, that would be really helpful. I'm guessing, but then are these Mbit/s figures? Would that be the sending throughput or the receiving throughput? I love to see netperf used, but why UDP and loopback? Also, how about the service demands? rick jones ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Mainline kernel OLTP performance update 2009-01-16 18:11 ` Rick Jones @ 2009-01-19 7:43 ` Nick Piggin 2009-01-19 22:19 ` Rick Jones 0 siblings, 1 reply; 93+ messages in thread From: Nick Piggin @ 2009-01-19 7:43 UTC (permalink / raw) To: Rick Jones Cc: Andrew Morton, netdev, sfr, matthew, matthew.r.wilcox, chinang.ma, linux-kernel, sharad.c.tripathi, arjan, andi.kleen, suresh.b.siddha, harita.chilukuri, douglas.w.styner, peter.xihong.wang, hubert.nueckel, chris.mason, srostedt, linux-scsi, andrew.vasquez, anirban.chakraborty On Saturday 17 January 2009 05:11:02 Rick Jones wrote: > Nick Piggin wrote: > > OK, I have these numbers to show I'm not completely off my rocker to > > suggest we merge SLQB :) Given these results, how about I ask to merge > > SLQB as default in linux-next, then if nothing catastrophic happens, > > merge it upstream in the next merge window, then a couple of releases > > after that, given some time to test and tweak SLQB, then we plan to bite > > the bullet and emerge with just one main slab allocator (plus SLOB). > > > > > > System is a 2socket, 4 core AMD. > > Not exactly a large system :) Barely NUMA even with just two sockets. You're right ;) But at least it is exercising the NUMA paths in the allocator, and represents a pretty common size of system... I can run some tests on bigger systems at SUSE, but it is not always easy to set up "real" meaningful workloads on them or configure significant IO for them. > > Netperf UDP unidirectional send test (10 runs, higher better): > > > > Server and client bound to same CPU > > SLAB AVG=60.111 STD=1.59382 > > SLQB AVG=60.167 STD=0.685347 > > SLUB AVG=58.277 STD=0.788328 > > > > Server and client bound to same socket, different CPUs > > SLAB AVG=85.938 STD=0.875794 > > SLQB AVG=93.662 STD=2.07434 > > SLUB AVG=81.983 STD=0.864362 > > > > Server and client bound to different sockets > > SLAB AVG=78.801 STD=1.44118 > > SLQB AVG=78.269 STD=1.10457 > > SLUB AVG=71.334 STD=1.16809 > > > > ... > > > > I haven't done any non-local network tests. Networking is the one of the > > subsystems most heavily dependent on slab performance, so if anybody > > cares to run their favourite tests, that would be really helpful. > > I'm guessing, but then are these Mbit/s figures? Would that be the sending > throughput or the receiving throughput? Yes, Mbit/s. They were... hmm, sending throughput I think, but each pair of numbers seemed to be identical IIRC? > I love to see netperf used, but why UDP and loopback? No really good reason. I guess I was hoping to keep other variables as small as possible. But I guess a real remote test would be a lot more realistic as a networking test. Hmm, but I could probably set up a test over a simple GbE link here. I'll try that. > Also, how about the > service demands? Well, over loopback and using CPU binding, I was hoping it wouldn't change much... but I see netperf does some measurements for you. I will consider those in future too. BTW. is it possible to do parallel netperf tests? ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Mainline kernel OLTP performance update 2009-01-19 7:43 ` Nick Piggin @ 2009-01-19 22:19 ` Rick Jones 0 siblings, 0 replies; 93+ messages in thread From: Rick Jones @ 2009-01-19 22:19 UTC (permalink / raw) To: Nick Piggin Cc: Andrew Morton, netdev, sfr, matthew, matthew.r.wilcox, chinang.ma, linux-kernel, sharad.c.tripathi, arjan, andi.kleen, suresh.b.siddha, harita.chilukuri, douglas.w.styner, peter.xihong.wang, hubert.nueckel, chris.mason, srostedt, linux-scsi, andrew.vasquez, anirban.chakraborty >>>System is a 2socket, 4 core AMD. >> >>Not exactly a large system :) Barely NUMA even with just two sockets. > > > You're right ;) > > But at least it is exercising the NUMA paths in the allocator, and > represents a pretty common size of system... > > I can run some tests on bigger systems at SUSE, but it is not always > easy to set up "real" meaningful workloads on them or configure > significant IO for them. Not sure if I know enough git to pull your trees, or if this cobbler's child will have much in the way of bigger systems, but there is a chance I might - contact me offline with some pointers on how to pull and build the bits and such. >>>Netperf UDP unidirectional send test (10 runs, higher better): >>> >>>Server and client bound to same CPU >>>SLAB AVG=60.111 STD=1.59382 >>>SLQB AVG=60.167 STD=0.685347 >>>SLUB AVG=58.277 STD=0.788328 >>> >>>Server and client bound to same socket, different CPUs >>>SLAB AVG=85.938 STD=0.875794 >>>SLQB AVG=93.662 STD=2.07434 >>>SLUB AVG=81.983 STD=0.864362 >>> >>>Server and client bound to different sockets >>>SLAB AVG=78.801 STD=1.44118 >>>SLQB AVG=78.269 STD=1.10457 >>>SLUB AVG=71.334 STD=1.16809 >>> >> >> > ... >> >>>I haven't done any non-local network tests. Networking is the one of the >>>subsystems most heavily dependent on slab performance, so if anybody >>>cares to run their favourite tests, that would be really helpful. >> >>I'm guessing, but then are these Mbit/s figures? Would that be the sending >>throughput or the receiving throughput? > > > Yes, Mbit/s. They were... hmm, sending throughput I think, but each pair > of numbers seemed to be identical IIRC? Mega *bits* per second? And those were 4K sends right? That seems rather low for loopback - I would have expected nearly two orders of magnitude more. I wonder if the intra-stack flow control kicked-in? You might try adding test specific -S and -s options to set much larger socket buffers to try to avoid that. Or simply use TCP. netperf -H <foo> ... -- -s 1M -S 1M -m 4K >>I love to see netperf used, but why UDP and loopback? > > > No really good reason. I guess I was hoping to keep other variables as > small as possible. But I guess a real remote test would be a lot more > realistic as a networking test. Hmm, but I could probably set up a test > over a simple GbE link here. I'll try that. If bandwidth is an issue, that is to say one saturates the link before much of anything "interesting" happens in the host you can use something like aggregate TCP_RR - ./configure with --enable_burst and then something like netperf -H <remote> -t TCP_RR -- -D -b 32 and it will have as many as 33 discrete transactions in flight at one time on the one connection. The -D is there to set TCP_NODELAY to preclude TCP chunking the single-byte (default, take your pick of a more reasonable size) transactions into one segment. >>Also, how about the service demands? > > > Well, over loopback and using CPU binding, I was hoping it wouldn't > change much... Hope... but verify :) > but I see netperf does some measurements for you. I > will consider those in future too. > > BTW. is it possible to do parallel netperf tests? Yes, by (ab)using the confidence intervals code. Poke around in http://www.netperf.org/svn/netperf2/doc/netperf.html in the "Aggregates" section, and I can go into further details offline (or here if folks want to see the discussion). rick jones ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Mainline kernel OLTP performance update 2009-01-15 2:04 ` Andrew Morton ` (2 preceding siblings ...) 2009-01-15 7:24 ` Nick Piggin @ 2009-01-15 14:12 ` James Bottomley 2009-01-15 17:44 ` Andrew Morton 3 siblings, 1 reply; 93+ messages in thread From: James Bottomley @ 2009-01-15 14:12 UTC (permalink / raw) To: Andrew Morton Cc: Matthew Wilcox, Wilcox, Matthew R, chinang.ma, linux-kernel, sharad.c.tripathi, arjan, andi.kleen, suresh.b.siddha, harita.chilukuri, douglas.w.styner, peter.xihong.wang, hubert.nueckel, chris.mason, srostedt, linux-scsi, Andrew Vasquez, Anirban Chakraborty On Wed, 2009-01-14 at 18:04 -0800, Andrew Morton wrote: > On Wed, 14 Jan 2009 18:21:47 -0700 Matthew Wilcox <matthew@wil.cx> wrote: > > On Wed, Jan 14, 2009 at 04:35:57PM -0800, Andrew Morton wrote: > > > > > Linux OLTP Performance summary > > > > > Kernel# Speedup(x) Intr/s CtxSw/s us% sys% idle% iowait% > > > > > 2.6.24.2 1.000 21969 43425 76 24 0 0 > > > > > 2.6.27.2 0.973 30402 43523 74 25 0 1 > > > > > 2.6.29-rc1 0.965 30331 41970 74 26 0 0 > > > > > But the interrupt rate went through the roof. > > > > Yes. I forget why that was; I'll have to dig through my archives for > > that. > > Oh. I'd have thought that this alone could account for 3.5%. Me too. Anecdotally, I haven't noticed this in my lab machines, but what I have noticed is on someone else's laptop (a hyperthreaded atom) that I was trying to demo powertop on was that IPI reschedule interrupts seem to be out of control ... they were ticking over at a really high rate and preventing the CPU from spending much time in the low C and P states. To me this implicates some scheduler problem since that's the primary producer of IPI reschedules ... I think it wouldn't be a significant extrapolation to predict that the scheduler might be the cause of the above problem as well. James ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Mainline kernel OLTP performance update 2009-01-15 14:12 ` James Bottomley @ 2009-01-15 17:44 ` Andrew Morton 2009-01-15 18:00 ` Matthew Wilcox 0 siblings, 1 reply; 93+ messages in thread From: Andrew Morton @ 2009-01-15 17:44 UTC (permalink / raw) To: James Bottomley Cc: Matthew Wilcox, Wilcox, Matthew R, chinang.ma, linux-kernel, sharad.c.tripathi, arjan, andi.kleen, suresh.b.siddha, harita.chilukuri, douglas.w.styner, peter.xihong.wang, hubert.nueckel, chris.mason, srostedt, linux-scsi, Andrew Vasquez, Anirban Chakraborty On Thu, 15 Jan 2009 09:12:46 -0500 James Bottomley <James.Bottomley@HansenPartnership.com> wrote: > On Wed, 2009-01-14 at 18:04 -0800, Andrew Morton wrote: > > On Wed, 14 Jan 2009 18:21:47 -0700 Matthew Wilcox <matthew@wil.cx> wrote: > > > On Wed, Jan 14, 2009 at 04:35:57PM -0800, Andrew Morton wrote: > > > > > > Linux OLTP Performance summary > > > > > > Kernel# Speedup(x) Intr/s CtxSw/s us% sys% idle% iowait% > > > > > > 2.6.24.2 1.000 21969 43425 76 24 0 0 > > > > > > 2.6.27.2 0.973 30402 43523 74 25 0 1 > > > > > > 2.6.29-rc1 0.965 30331 41970 74 26 0 0 > > > > > > > But the interrupt rate went through the roof. > > > > > > Yes. I forget why that was; I'll have to dig through my archives for > > > that. > > > > Oh. I'd have thought that this alone could account for 3.5%. > > Me too. Anecdotally, I haven't noticed this in my lab machines, but > what I have noticed is on someone else's laptop (a hyperthreaded atom) > that I was trying to demo powertop on was that IPI reschedule interrupts > seem to be out of control ... they were ticking over at a really high > rate and preventing the CPU from spending much time in the low C and P > states. To me this implicates some scheduler problem since that's the > primary producer of IPI reschedules ... I think it wouldn't be a > significant extrapolation to predict that the scheduler might be the > cause of the above problem as well. > Good point. The context switch rate actually went down a bit. I wonder if the Intel test people have records of /proc/interrupts for the various kernel versions. ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Mainline kernel OLTP performance update 2009-01-15 17:44 ` Andrew Morton @ 2009-01-15 18:00 ` Matthew Wilcox 2009-01-15 18:14 ` Steven Rostedt 0 siblings, 1 reply; 93+ messages in thread From: Matthew Wilcox @ 2009-01-15 18:00 UTC (permalink / raw) To: Andrew Morton Cc: James Bottomley, Wilcox, Matthew R, chinang.ma, linux-kernel, sharad.c.tripathi, arjan, andi.kleen, suresh.b.siddha, harita.chilukuri, douglas.w.styner, peter.xihong.wang, hubert.nueckel, chris.mason, srostedt, linux-scsi, Andrew Vasquez, Anirban Chakraborty On Thu, Jan 15, 2009 at 09:44:42AM -0800, Andrew Morton wrote: > > Me too. Anecdotally, I haven't noticed this in my lab machines, but > > what I have noticed is on someone else's laptop (a hyperthreaded atom) > > that I was trying to demo powertop on was that IPI reschedule interrupts > > seem to be out of control ... they were ticking over at a really high > > rate and preventing the CPU from spending much time in the low C and P > > states. To me this implicates some scheduler problem since that's the > > primary producer of IPI reschedules ... I think it wouldn't be a > > significant extrapolation to predict that the scheduler might be the > > cause of the above problem as well. > > > > Good point. > > The context switch rate actually went down a bit. > > I wonder if the Intel test people have records of /proc/interrupts for > the various kernel versions. I think Chinang does, but he's out of office today. He did say in an earlier reply: > I took a quick look at the interrupts figure between 2.6.24 and 2.6.27. > i/o interuputs is slightly down in 2.6.27 (due to reduce throughput). > But both NMI and reschedule interrupt increased. Reschedule interrupts > is 2x of 2.6.24. So if the reschedule interrupt is happening twice as often, and the context switch rate is basically unchanged, I guess that means the scheduler is doing a lot more work to get approximately the same results. And that seems like a bad thing. Again, it's worth bearing in mind that these are all RT tasks, so the underlying problem may be very different from the one that both James and I have observed with an Atom laptop running predominantly non-RT tasks. -- Matthew Wilcox Intel Open Source Technology Centre "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Mainline kernel OLTP performance update 2009-01-15 18:00 ` Matthew Wilcox @ 2009-01-15 18:14 ` Steven Rostedt 2009-01-15 18:44 ` Gregory Haskins 2009-01-15 19:28 ` Ma, Chinang 0 siblings, 2 replies; 93+ messages in thread From: Steven Rostedt @ 2009-01-15 18:14 UTC (permalink / raw) To: Matthew Wilcox Cc: Andrew Morton, James Bottomley, Wilcox, Matthew R, chinang.ma, linux-kernel, sharad.c.tripathi, arjan, andi.kleen, suresh.b.siddha, harita.chilukuri, douglas.w.styner, peter.xihong.wang, hubert.nueckel, chris.mason, linux-scsi, Andrew Vasquez, Anirban Chakraborty, Gregory Haskins On Thu, 2009-01-15 at 11:00 -0700, Matthew Wilcox wrote: > On Thu, Jan 15, 2009 at 09:44:42AM -0800, Andrew Morton wrote: > > > Me too. Anecdotally, I haven't noticed this in my lab machines, but > > > what I have noticed is on someone else's laptop (a hyperthreaded atom) > > > that I was trying to demo powertop on was that IPI reschedule interrupts > > > seem to be out of control ... they were ticking over at a really high > > > rate and preventing the CPU from spending much time in the low C and P > > > states. To me this implicates some scheduler problem since that's the > > > primary producer of IPI reschedules ... I think it wouldn't be a > > > significant extrapolation to predict that the scheduler might be the > > > cause of the above problem as well. > > > > > > > Good point. > > > > The context switch rate actually went down a bit. > > > > I wonder if the Intel test people have records of /proc/interrupts for > > the various kernel versions. > > I think Chinang does, but he's out of office today. He did say in an > earlier reply: > > > I took a quick look at the interrupts figure between 2.6.24 and 2.6.27. > > i/o interuputs is slightly down in 2.6.27 (due to reduce throughput). > > But both NMI and reschedule interrupt increased. Reschedule interrupts > > is 2x of 2.6.24. > > So if the reschedule interrupt is happening twice as often, and the > context switch rate is basically unchanged, I guess that means the > scheduler is doing a lot more work to get approximately the same > results. And that seems like a bad thing. > > Again, it's worth bearing in mind that these are all RT tasks, so the > underlying problem may be very different from the one that both James and > I have observed with an Atom laptop running predominantly non-RT tasks. > The RT scheduler is a bit more aggressive than it use to be. It use to just migrate RT tasks when the migration thread woke up, and did that in "bulk". Now, when an individual RT task wakes up and it can not run on the current CPU but can on another CPU, it is scheduled immediately, and an IPI is sent out. As for context switching, it would be the same amount as before, but the difference is that the RT task will try to wake up as soon as possible. This also causes RT tasks to bounce around CPUs more often. If there are many threads, they should not be RT, unless there is some design behind it. Forgive me if you already did this and said so, but what is the result of just making the writer an RT task and keeping all the readers as SCHED_OTHER? -- Steve ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Mainline kernel OLTP performance update 2009-01-15 18:14 ` Steven Rostedt @ 2009-01-15 18:44 ` Gregory Haskins 2009-01-15 18:46 ` Wilcox, Matthew R 2009-01-15 19:28 ` Ma, Chinang 1 sibling, 1 reply; 93+ messages in thread From: Gregory Haskins @ 2009-01-15 18:44 UTC (permalink / raw) To: Steven Rostedt Cc: Matthew Wilcox, Andrew Morton, James Bottomley, Wilcox, Matthew R, chinang.ma, linux-kernel, sharad.c.tripathi, arjan, andi.kleen, suresh.b.siddha, harita.chilukuri, douglas.w.styner, peter.xihong.wang, hubert.nueckel, chris.mason, linux-scsi, Andrew Vasquez, Anirban Chakraborty [-- Attachment #1: Type: text/plain, Size: 2605 bytes --] Steven Rostedt wrote: > On Thu, 2009-01-15 at 11:00 -0700, Matthew Wilcox wrote: > >> On Thu, Jan 15, 2009 at 09:44:42AM -0800, Andrew Morton wrote: >> >>>> Me too. Anecdotally, I haven't noticed this in my lab machines, but >>>> what I have noticed is on someone else's laptop (a hyperthreaded atom) >>>> that I was trying to demo powertop on was that IPI reschedule interrupts >>>> seem to be out of control ... they were ticking over at a really high >>>> rate and preventing the CPU from spending much time in the low C and P >>>> states. To me this implicates some scheduler problem since that's the >>>> primary producer of IPI reschedules ... I think it wouldn't be a >>>> significant extrapolation to predict that the scheduler might be the >>>> cause of the above problem as well. >>>> >>>> >>> Good point. >>> >>> The context switch rate actually went down a bit. >>> >>> I wonder if the Intel test people have records of /proc/interrupts for >>> the various kernel versions. >>> >> I think Chinang does, but he's out of office today. He did say in an >> earlier reply: >> >> >>> I took a quick look at the interrupts figure between 2.6.24 and 2.6.27. >>> i/o interuputs is slightly down in 2.6.27 (due to reduce throughput). >>> But both NMI and reschedule interrupt increased. Reschedule interrupts >>> is 2x of 2.6.24. >>> >> So if the reschedule interrupt is happening twice as often, and the >> context switch rate is basically unchanged, I guess that means the >> scheduler is doing a lot more work to get approximately the same >> results. And that seems like a bad thing. >> I would be very interested in gathering some data in this area. One thing that pops to mind is to instrument the resched-ipi with ftrace_printk() and gather a trace of this system in action. I assume that I wouldn't have access to this OLTP suite, so I may need a volunteer to try this for me. I could put together an instrumentation patch for the testers convenience if they prefer. Another data-point I wouldn't mind seeing is looking at the scheduler statistics, particularly with my sched-top utility, which you can find here: http://rt.wiki.kernel.org/index.php/Schedtop_utility (Note you may want to exclude the sched_info stats, as they are inherently noisy and make it hard to see the real trends. To do this run it with: 'schedtop -x "sched_info"' In the meantime, I will try similar approaches here on other non-OLTP based workloads to see if I spy anything that looks amiss. -Greg [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 257 bytes --] ^ permalink raw reply [flat|nested] 93+ messages in thread
* RE: Mainline kernel OLTP performance update 2009-01-15 18:44 ` Gregory Haskins @ 2009-01-15 18:46 ` Wilcox, Matthew R 2009-01-15 19:44 ` Ma, Chinang 0 siblings, 1 reply; 93+ messages in thread From: Wilcox, Matthew R @ 2009-01-15 18:46 UTC (permalink / raw) To: Gregory Haskins, Steven Rostedt Cc: Matthew Wilcox, Andrew Morton, James Bottomley, Ma, Chinang, linux-kernel, Tripathi, Sharad C, arjan, Kleen, Andi, Siddha, Suresh B, Chilukuri, Harita, Styner, Douglas W, Wang, Peter Xihong, Nueckel, Hubert, chris.mason, linux-scsi, Andrew Vasquez, Anirban Chakraborty [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #1: Type: text/plain; charset="utf-8", Size: 1342 bytes --] Gregory Haskins [mailto:ghaskins@novell.com] wrote: > > On Thu, 2009-01-15 at 11:00 -0700, Matthew Wilcox wrote: > >> So if the reschedule interrupt is happening twice as often, and the > >> context switch rate is basically unchanged, I guess that means the > >> scheduler is doing a lot more work to get approximately the same > >> results. And that seems like a bad thing. > > I would be very interested in gathering some data in this area. One > thing that pops to mind is to instrument the resched-ipi with > ftrace_printk() and gather a trace of this system in action. I assume > that I wouldn't have access to this OLTP suite, so I may need a > volunteer to try this for me. I could put together an instrumentation > patch for the testers convenience if they prefer. I don't know whether Novell have an arrangement with the Well-Known Commercial Database and the Well-Known OLTP Benchmark to do runs like this. Chinang is normally only too happy to build his own kernels with patches from people who are interested in helping, so that's probably the best way to do it. I'm leaving for LCA in an hour or so, so further responses from me to this thread are unlikely ;-) ÿôèº{.nÇ+·®+%Ëÿ±éݶ\x17¥wÿº{.nÇ+·¥{±þG«éÿ{ayº\x1dÊÚë,j\a¢f£¢·hïêÿêçz_è®\x03(éÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?¨èÚ&£ø§~á¶iOæ¬z·vØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?I¥ ^ permalink raw reply [flat|nested] 93+ messages in thread
* RE: Mainline kernel OLTP performance update 2009-01-15 18:46 ` Wilcox, Matthew R @ 2009-01-15 19:44 ` Ma, Chinang 2009-01-16 18:14 ` Gregory Haskins 0 siblings, 1 reply; 93+ messages in thread From: Ma, Chinang @ 2009-01-15 19:44 UTC (permalink / raw) To: Wilcox, Matthew R, Gregory Haskins, Steven Rostedt Cc: Matthew Wilcox, Andrew Morton, James Bottomley, linux-kernel, Tripathi, Sharad C, arjan, Kleen, Andi, Siddha, Suresh B, Chilukuri, Harita, Styner, Douglas W, Wang, Peter Xihong, Nueckel, Hubert, chris.mason, linux-scsi, Andrew Vasquez, Anirban Chakraborty Gregory. I will test the resched-ipi instrumentation patch with our OLTP if you can post the patch and some instructions. Thanks, -Chinang >-----Original Message----- >From: Wilcox, Matthew R >Sent: Thursday, January 15, 2009 10:47 AM >To: Gregory Haskins; Steven Rostedt >Cc: Matthew Wilcox; Andrew Morton; James Bottomley; Ma, Chinang; linux- >kernel@vger.kernel.org; Tripathi, Sharad C; arjan@linux.intel.com; Kleen, >Andi; Siddha, Suresh B; Chilukuri, Harita; Styner, Douglas W; Wang, Peter >Xihong; Nueckel, Hubert; chris.mason@oracle.com; linux-scsi@vger.kernel.org; >Andrew Vasquez; Anirban Chakraborty >Subject: RE: Mainline kernel OLTP performance update > >Gregory Haskins [mailto:ghaskins@novell.com] wrote: >> > On Thu, 2009-01-15 at 11:00 -0700, Matthew Wilcox wrote: >> >> So if the reschedule interrupt is happening twice as often, and the >> >> context switch rate is basically unchanged, I guess that means the >> >> scheduler is doing a lot more work to get approximately the same >> >> results. And that seems like a bad thing. >> >> I would be very interested in gathering some data in this area. One >> thing that pops to mind is to instrument the resched-ipi with >> ftrace_printk() and gather a trace of this system in action. I assume >> that I wouldn't have access to this OLTP suite, so I may need a >> volunteer to try this for me. I could put together an instrumentation >> patch for the testers convenience if they prefer. > >I don't know whether Novell have an arrangement with the Well-Known >Commercial Database and the Well-Known OLTP Benchmark to do runs like this. >Chinang is normally only too happy to build his own kernels with patches >from people who are interested in helping, so that's probably the best way >to do it. > >I'm leaving for LCA in an hour or so, so further responses from me to this >thread are unlikely ;-) ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Mainline kernel OLTP performance update 2009-01-15 19:44 ` Ma, Chinang @ 2009-01-16 18:14 ` Gregory Haskins 2009-01-16 19:09 ` Steven Rostedt 2009-01-20 12:45 ` Gregory Haskins 0 siblings, 2 replies; 93+ messages in thread From: Gregory Haskins @ 2009-01-16 18:14 UTC (permalink / raw) To: Ma, Chinang Cc: Wilcox, Matthew R, Steven Rostedt, Matthew Wilcox, Andrew Morton, James Bottomley, linux-kernel, Tripathi, Sharad C, arjan, Kleen, Andi, Siddha, Suresh B, Chilukuri, Harita, Styner, Douglas W, Wang, Peter Xihong, Nueckel, Hubert, chris.mason, linux-scsi, Andrew Vasquez, Anirban Chakraborty [-- Attachment #1.1: Type: text/plain, Size: 1964 bytes --] Ma, Chinang wrote: > Gregory. > I will test the resched-ipi instrumentation patch with our OLTP if you can post the patch and some instructions. > Thanks, > -Chinang > Hi Chinang, Please find a patch attached which applies to linus.git as of today. You will also want to enable CONFIG_FUNCTION_TRACER as well as the trace components. Here is my system: ghaskins@dev:~/sandbox/git/linux-2.6-rt> grep TRACE .config CONFIG_STACKTRACE_SUPPORT=y CONFIG_TRACEPOINTS=y CONFIG_HAVE_ARCH_TRACEHOOK=y CONFIG_BLK_DEV_IO_TRACE=y # CONFIG_TREE_RCU_TRACE is not set # CONFIG_PREEMPT_RCU_TRACE is not set CONFIG_X86_PTRACE_BTS=y # CONFIG_ACPI_DEBUG_FUNC_TRACE is not set CONFIG_NETFILTER_XT_TARGET_TRACE=m CONFIG_SOUND_TRACEINIT=y CONFIG_TRACE_IRQFLAGS_SUPPORT=y CONFIG_TRACE_IRQFLAGS=y CONFIG_STACKTRACE=y # CONFIG_BACKTRACE_SELF_TEST is not set CONFIG_USER_STACKTRACE_SUPPORT=y CONFIG_NOP_TRACER=y CONFIG_HAVE_FUNCTION_TRACER=y CONFIG_HAVE_FUNCTION_GRAPH_TRACER=y CONFIG_HAVE_FUNCTION_TRACE_MCOUNT_TEST=y CONFIG_HAVE_DYNAMIC_FTRACE=y CONFIG_HAVE_FTRACE_MCOUNT_RECORD=y CONFIG_HAVE_HW_BRANCH_TRACER=y CONFIG_TRACER_MAX_TRACE=y CONFIG_FUNCTION_TRACER=y CONFIG_FUNCTION_GRAPH_TRACER=y CONFIG_IRQSOFF_TRACER=y CONFIG_SYSPROF_TRACER=y CONFIG_SCHED_TRACER=y CONFIG_CONTEXT_SWITCH_TRACER=y # CONFIG_BOOT_TRACER is not set # CONFIG_TRACE_BRANCH_PROFILING is not set CONFIG_POWER_TRACER=y CONFIG_STACK_TRACER=y CONFIG_HW_BRANCH_TRACER=y CONFIG_DYNAMIC_FTRACE=y CONFIG_FTRACE_MCOUNT_RECORD=y # CONFIG_FTRACE_STARTUP_TEST is not set # CONFIG_MMIOTRACE is not set # CONFIG_KVM_TRACE is not set Then on your booted system, do: echo sched_switch > /sys/kernel/debug/tracing/current_tracer echo 1 > /sys/kernel/debug/tracing/tracing_enabled $run_oltp && echo 0 > /sys/kernel/debug/tracing/tracing_enabled (where $run_oltp is your suite) Then, email the contents of /sys/kernel/debug/tracing/trace to me -Greg [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #1.2: instrumentation.patch --] [-- Type: text/x-patch; name="instrumentation.patch", Size: 3263 bytes --] ftrace instrumentation for RT tasks From: Gregory Haskins <ghaskins@novell.com> Signed-off-by: Gregory Haskins <ghaskins@novell.com> --- arch/x86/kernel/smp.c | 2 ++ include/linux/sched.h | 6 ++++++ kernel/sched.c | 3 +++ kernel/sched_rt.c | 10 ++++++++++ 4 files changed, 21 insertions(+), 0 deletions(-) diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c index e6faa33..468abeb 100644 --- a/arch/x86/kernel/smp.c +++ b/arch/x86/kernel/smp.c @@ -118,6 +118,7 @@ static void native_smp_send_reschedule(int cpu) WARN_ON(1); return; } + ftrace_printk("cpu %d\n", cpu); send_IPI_mask(cpumask_of(cpu), RESCHEDULE_VECTOR); } @@ -171,6 +172,7 @@ static void native_smp_send_stop(void) */ void smp_reschedule_interrupt(struct pt_regs *regs) { + ftrace_printk("NEEDS_RESCHED\n"); ack_APIC_irq(); inc_irq_stat(irq_resched_count); } diff --git a/include/linux/sched.h b/include/linux/sched.h index 4cae9b8..a320692 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -2094,8 +2094,14 @@ static inline int test_tsk_thread_flag(struct task_struct *tsk, int flag) return test_ti_thread_flag(task_thread_info(tsk), flag); } +# define ftrace_printk(fmt...) __ftrace_printk(_THIS_IP_, fmt) +extern int +__ftrace_printk(unsigned long ip, const char *fmt, ...) + __attribute__ ((format (printf, 2, 3))); + static inline void set_tsk_need_resched(struct task_struct *tsk) { + ftrace_printk("%s/%d\n", tsk->comm, tsk->pid); set_tsk_thread_flag(tsk,TIF_NEED_RESCHED); } diff --git a/kernel/sched.c b/kernel/sched.c index 52bbf1c..d55fcf1 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -1874,6 +1874,9 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu) *new_cfsrq = cpu_cfs_rq(old_cfsrq, new_cpu); u64 clock_offset; + ftrace_printk("migrate %s/%d [%d] -> [%d]\n", + p->comm, p->pid, task_cpu(p), new_cpu); + clock_offset = old_rq->clock - new_rq->clock; trace_sched_migrate_task(p, task_cpu(p), new_cpu); diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c index 954e1a8..59cf64b 100644 --- a/kernel/sched_rt.c +++ b/kernel/sched_rt.c @@ -1102,6 +1102,8 @@ static int push_rt_task(struct rq *rq) if (!next_task) return 0; + ftrace_printk("attempting push\n"); + retry: if (unlikely(next_task == rq->curr)) { WARN_ON(1); @@ -1139,6 +1141,8 @@ static int push_rt_task(struct rq *rq) goto out; } + ftrace_printk("%s/%d\n", next_task->comm, next_task->pid); + deactivate_task(rq, next_task, 0); set_task_cpu(next_task, lowest_rq->cpu); activate_task(lowest_rq, next_task, 0); @@ -1180,6 +1184,8 @@ static int pull_rt_task(struct rq *this_rq) if (likely(!rt_overloaded(this_rq))) return 0; + ftrace_printk("attempting pull\n"); + next = pick_next_task_rt(this_rq); for_each_cpu(cpu, this_rq->rd->rto_mask) { @@ -1234,6 +1240,10 @@ static int pull_rt_task(struct rq *this_rq) ret = 1; + ftrace_printk("pull %s/%d [%d] -> [%d]\n", + p->comm, p->pid, + src_rq->cpu, this_rq->cpu); + deactivate_task(src_rq, p, 0); set_task_cpu(p, this_cpu); activate_task(this_rq, p, 0); [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 257 bytes --] ^ permalink raw reply related [flat|nested] 93+ messages in thread
* Re: Mainline kernel OLTP performance update 2009-01-16 18:14 ` Gregory Haskins @ 2009-01-16 19:09 ` Steven Rostedt 2009-01-20 12:45 ` Gregory Haskins 1 sibling, 0 replies; 93+ messages in thread From: Steven Rostedt @ 2009-01-16 19:09 UTC (permalink / raw) To: Gregory Haskins Cc: Ma, Chinang, Wilcox, Matthew R, Matthew Wilcox, Andrew Morton, James Bottomley, linux-kernel, Tripathi, Sharad C, arjan, Kleen, Andi, Siddha, Suresh B, Chilukuri, Harita, Styner, Douglas W, Wang, Peter Xihong, Nueckel, Hubert, chris.mason, linux-scsi, Andrew Vasquez, Anirban Chakraborty On Fri, 2009-01-16 at 13:14 -0500, Gregory Haskins wrote: > Ma, Chinang wrote: > > Gregory. > > I will test the resched-ipi instrumentation patch with our OLTP if you can post the patch and some instructions. > > Thanks, > > -Chinang > > > > Hi Chinang, > Please find a patch attached which applies to linus.git as of today. > You will also want to enable CONFIG_FUNCTION_TRACER as well as the trace > components. Here is my system: > I don't see why CONFIG_FUNCTION_TRACER is needed. > ghaskins@dev:~/sandbox/git/linux-2.6-rt> grep TRACE .config > CONFIG_STACKTRACE_SUPPORT=y > CONFIG_TRACEPOINTS=y > CONFIG_HAVE_ARCH_TRACEHOOK=y > CONFIG_BLK_DEV_IO_TRACE=y > # CONFIG_TREE_RCU_TRACE is not set > # CONFIG_PREEMPT_RCU_TRACE is not set > CONFIG_X86_PTRACE_BTS=y > # CONFIG_ACPI_DEBUG_FUNC_TRACE is not set > CONFIG_NETFILTER_XT_TARGET_TRACE=m > CONFIG_SOUND_TRACEINIT=y > CONFIG_TRACE_IRQFLAGS_SUPPORT=y > CONFIG_TRACE_IRQFLAGS=y > CONFIG_STACKTRACE=y > # CONFIG_BACKTRACE_SELF_TEST is not set > CONFIG_USER_STACKTRACE_SUPPORT=y > CONFIG_NOP_TRACER=y > CONFIG_HAVE_FUNCTION_TRACER=y > CONFIG_HAVE_FUNCTION_GRAPH_TRACER=y > CONFIG_HAVE_FUNCTION_TRACE_MCOUNT_TEST=y > CONFIG_HAVE_DYNAMIC_FTRACE=y > CONFIG_HAVE_FTRACE_MCOUNT_RECORD=y > CONFIG_HAVE_HW_BRANCH_TRACER=y > CONFIG_TRACER_MAX_TRACE=y > CONFIG_FUNCTION_TRACER=y > CONFIG_FUNCTION_GRAPH_TRACER=y > CONFIG_IRQSOFF_TRACER=y > CONFIG_SYSPROF_TRACER=y > CONFIG_SCHED_TRACER=y This CONFIG_SCHED_TRACER should be enough. -- Steve > CONFIG_CONTEXT_SWITCH_TRACER=y > # CONFIG_BOOT_TRACER is not set > # CONFIG_TRACE_BRANCH_PROFILING is not set > CONFIG_POWER_TRACER=y > CONFIG_STACK_TRACER=y > CONFIG_HW_BRANCH_TRACER=y > CONFIG_DYNAMIC_FTRACE=y > CONFIG_FTRACE_MCOUNT_RECORD=y > # CONFIG_FTRACE_STARTUP_TEST is not set > # CONFIG_MMIOTRACE is not set > # CONFIG_KVM_TRACE is not set > > > Then on your booted system, do: > > echo sched_switch > /sys/kernel/debug/tracing/current_tracer > echo 1 > /sys/kernel/debug/tracing/tracing_enabled > $run_oltp && echo 0 > /sys/kernel/debug/tracing/tracing_enabled > > (where $run_oltp is your suite) > > Then, email the contents of /sys/kernel/debug/tracing/trace to me > > -Greg > ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Mainline kernel OLTP performance update 2009-01-16 18:14 ` Gregory Haskins 2009-01-16 19:09 ` Steven Rostedt @ 2009-01-20 12:45 ` Gregory Haskins 1 sibling, 0 replies; 93+ messages in thread From: Gregory Haskins @ 2009-01-20 12:45 UTC (permalink / raw) To: Ma, Chinang Cc: Wilcox, Matthew R, Steven Rostedt, Matthew Wilcox, Andrew Morton, James Bottomley, linux-kernel, Tripathi, Sharad C, arjan, Kleen, Andi, Siddha, Suresh B, Chilukuri, Harita, Styner, Douglas W, Wang, Peter Xihong, Nueckel, Hubert, chris.mason, linux-scsi, Andrew Vasquez, Anirban Chakraborty [-- Attachment #1: Type: text/plain, Size: 1535 bytes --] Gregory Haskins wrote: > > Then, email the contents of /sys/kernel/debug/tracing/trace to me > > > [ Chinang has performed the trace as requested, but replied with a reduced CC to avoid spamming people with a large file. This is restoring the original list] Ma, Chinang wrote: > Hi Gregory, > Trace in attachment. I trim down the distribution list. As the attachment is quite big. > > Thanks, > -Chinang > Hi Chinang, Thank you very much for taking the time to do this. I have analyzed the trace: I do not see any smoking gun w.r.t. the theory that we are over IPI'ing the system. There were holes in the data due to trace limitations that rendered some of the data inconclusive. However, the places where we did not run into trace limitations looked like everything was functioning as designed. That being said, I do see that you have a ton of prio 48(ish) threads that are over-straining the RT push logic. The interesting thing here is I recently pushed some patches to tip that have potential to help you here. Could you try your test using the sched/rt branch from -tip? Here is a clone link, for your convenience: git://git.kernel.org/pub/scm/linux/kernel/git/x86/linux-2.6-tip.git sched/rt For this run, do _not_ use the trace patch/config. I just want to see if you observe performance improvements with OLTP configured for RT prio when compared to historic rt-push/pull based kernels (including HEAD on linus.git, as tested in the last run) Thanks! -Greg [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 257 bytes --] ^ permalink raw reply [flat|nested] 93+ messages in thread
* RE: Mainline kernel OLTP performance update 2009-01-15 18:14 ` Steven Rostedt 2009-01-15 18:44 ` Gregory Haskins @ 2009-01-15 19:28 ` Ma, Chinang 1 sibling, 0 replies; 93+ messages in thread From: Ma, Chinang @ 2009-01-15 19:28 UTC (permalink / raw) To: Steven Rostedt, Matthew Wilcox Cc: Andrew Morton, James Bottomley, Wilcox, Matthew R, linux-kernel, Tripathi, Sharad C, arjan, Kleen, Andi, Siddha, Suresh B, Chilukuri, Harita, Styner, Douglas W, Wang, Peter Xihong, Nueckel, Hubert, chris.mason, linux-scsi, Andrew Vasquez, Anirban Chakraborty, Gregory Haskins >-----Original Message----- >From: Steven Rostedt [mailto:srostedt@redhat.com] >Sent: Thursday, January 15, 2009 10:15 AM >To: Matthew Wilcox >Cc: Andrew Morton; James Bottomley; Wilcox, Matthew R; Ma, Chinang; linux- >kernel@vger.kernel.org; Tripathi, Sharad C; arjan@linux.intel.com; Kleen, >Andi; Siddha, Suresh B; Chilukuri, Harita; Styner, Douglas W; Wang, Peter >Xihong; Nueckel, Hubert; chris.mason@oracle.com; linux-scsi@vger.kernel.org; >Andrew Vasquez; Anirban Chakraborty; Gregory Haskins >Subject: Re: Mainline kernel OLTP performance update > > >On Thu, 2009-01-15 at 11:00 -0700, Matthew Wilcox wrote: >> On Thu, Jan 15, 2009 at 09:44:42AM -0800, Andrew Morton wrote: >> > > Me too. Anecdotally, I haven't noticed this in my lab machines, but >> > > what I have noticed is on someone else's laptop (a hyperthreaded atom) >> > > that I was trying to demo powertop on was that IPI reschedule >interrupts >> > > seem to be out of control ... they were ticking over at a really high >> > > rate and preventing the CPU from spending much time in the low C and >P >> > > states. To me this implicates some scheduler problem since that's >the >> > > primary producer of IPI reschedules ... I think it wouldn't be a >> > > significant extrapolation to predict that the scheduler might be the >> > > cause of the above problem as well. >> > > >> > >> > Good point. >> > >> > The context switch rate actually went down a bit. >> > >> > I wonder if the Intel test people have records of /proc/interrupts for >> > the various kernel versions. >> >> I think Chinang does, but he's out of office today. He did say in an >> earlier reply: >> >> > I took a quick look at the interrupts figure between 2.6.24 and 2.6.27. >> > i/o interuputs is slightly down in 2.6.27 (due to reduce throughput). >> > But both NMI and reschedule interrupt increased. Reschedule interrupts >> > is 2x of 2.6.24. >> >> So if the reschedule interrupt is happening twice as often, and the >> context switch rate is basically unchanged, I guess that means the >> scheduler is doing a lot more work to get approximately the same >> results. And that seems like a bad thing. >> >> Again, it's worth bearing in mind that these are all RT tasks, so the >> underlying problem may be very different from the one that both James and >> I have observed with an Atom laptop running predominantly non-RT tasks. >> > >The RT scheduler is a bit more aggressive than it use to be. It use to >just migrate RT tasks when the migration thread woke up, and did that in >"bulk". Now, when an individual RT task wakes up and it can not run on >the current CPU but can on another CPU, it is scheduled immediately, and >an IPI is sent out. > >As for context switching, it would be the same amount as before, but the >difference is that the RT task will try to wake up as soon as possible. >This also causes RT tasks to bounce around CPUs more often. > >If there are many threads, they should not be RT, unless there is some >design behind it. > >Forgive me if you already did this and said so, but what is the result >of just making the writer an RT task and keeping all the readers as >SCHED_OTHER? > >-- Steve > I think the high OLTP throughtput with rt-prio is due to the fixed time-slice. It is better to give DBMS process a bigger timeslice for getting a data buffer lock, process data, release the lock and switch out due to waiting on i/o instead of being force to switch out while still holding a data lock. I suppose SCHED_OTHER is the default policy for user processes. We tried setting only the log writer to RT and left all other DBMS orocess in default sched policy and the performance is ~1.5% lower than the all rt-prio process result. ^ permalink raw reply [flat|nested] 93+ messages in thread
* RE: Mainline kernel OLTP performance update 2009-01-15 1:21 ` Matthew Wilcox 2009-01-15 2:04 ` Andrew Morton @ 2009-01-15 16:48 ` Ma, Chinang 1 sibling, 0 replies; 93+ messages in thread From: Ma, Chinang @ 2009-01-15 16:48 UTC (permalink / raw) To: Matthew Wilcox, Andrew Morton Cc: Wilcox, Matthew R, linux-kernel, Tripathi, Sharad C, arjan, Kleen, Andi, Siddha, Suresh B, Chilukuri, Harita, Styner, Douglas W, Wang, Peter Xihong, Nueckel, Hubert, chris.mason, srostedt, linux-scsi, Andrew Vasquez, Anirban Chakraborty >-----Original Message----- >From: Matthew Wilcox [mailto:matthew@wil.cx] >Sent: Wednesday, January 14, 2009 5:22 PM >To: Andrew Morton >Cc: Wilcox, Matthew R; Ma, Chinang; linux-kernel@vger.kernel.org; Tripathi, >Sharad C; arjan@linux.intel.com; Kleen, Andi; Siddha, Suresh B; Chilukuri, >Harita; Styner, Douglas W; Wang, Peter Xihong; Nueckel, Hubert; >chris.mason@oracle.com; srostedt@redhat.com; linux-scsi@vger.kernel.org; >Andrew Vasquez; Anirban Chakraborty >Subject: Re: Mainline kernel OLTP performance update > >On Wed, Jan 14, 2009 at 04:35:57PM -0800, Andrew Morton wrote: >> On Tue, 13 Jan 2009 15:44:17 -0700 >> "Wilcox, Matthew R" <matthew.r.wilcox@intel.com> wrote: >> > >> >> (top-posting repaired. That @intel.com address is a bad influence ;)) > >Alas, that email address goes to an Outlook client. Not much to be done >about that. > >> (cc linux-scsi) >> >> > > This is latest 2.6.29-rc1 kernel OLTP performance result. Compare to >> > > 2.6.24.2 the regression is around 3.5%. >> > > >> > > Linux OLTP Performance summary >> > > Kernel# Speedup(x) Intr/s CtxSw/s us% sys% idle% >iowait% >> > > 2.6.24.2 1.000 21969 43425 76 24 0 0 >> > > 2.6.27.2 0.973 30402 43523 74 25 0 1 >> > > 2.6.29-rc1 0.965 30331 41970 74 26 0 0 > >> But the interrupt rate went through the roof. > >Yes. I forget why that was; I'll have to dig through my archives for >that. I took a quick look at the interrupts figure between 2.6.24 and 2.6.27. i/o interuputs is slightly down in 2.6.27 (due to reduce throughput). But both NMI and reschedule interrupt increased. Reschedule interrupts is 2x of 2.6.24. > >> A 3.5% slowdown in this workload is considered pretty serious, isn't it? > >Yes. Anything above 0.3% is statistically significant. 1% is a big >deal. The fact that we've lost 3.5% in the last year doesn't make >people happy. There's a few things we've identified that have a big >effect: > > - Per-partition statistics. Putting in a sysctl to stop doing them gets > some of that back, but not as much as taking them out (even when > the sysctl'd variable is in a __read_mostly section). We tried a > patch from Jens to speed up the search for a new partition, but it > had no effect. > > - The RT scheduler changes. They're better for some RT tasks, but not > the database benchmark workload. Chinang has posted about > this before, but the thread didn't really go anywhere. > http://marc.info/?t=122903815000001&r=1&w=2 > >SLUB would have had a huge negative effect if we were using it -- on the >order of 7% iirc. SLQB is at least performance-neutral with SLAB. > >-- >Matthew Wilcox Intel Open Source Technology Centre >"Bill, look, we understand that you're interested in selling us this >operating system, but compare it to ours. We can't possibly take such >a retrograde step." -Chinang ^ permalink raw reply [flat|nested] 93+ messages in thread
end of thread, other threads:[~2009-02-12 16:08 UTC | newest] Thread overview: 93+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2009-01-13 21:10 Mainline kernel OLTP performance update Ma, Chinang 2009-01-13 22:44 ` Wilcox, Matthew R 2009-01-15 0:35 ` Andrew Morton 2009-01-15 1:21 ` Matthew Wilcox 2009-01-15 2:04 ` Andrew Morton 2009-01-15 2:27 ` Steven Rostedt 2009-01-15 7:11 ` Ma, Chinang 2009-01-19 18:04 ` Chris Mason 2009-01-19 18:37 ` Steven Rostedt 2009-01-19 18:55 ` Chris Mason 2009-01-19 19:07 ` Steven Rostedt 2009-01-19 23:40 ` Ingo Molnar 2009-01-15 2:39 ` Andi Kleen 2009-01-15 2:47 ` Matthew Wilcox 2009-01-15 3:36 ` Andi Kleen 2009-01-20 13:27 ` Jens Axboe [not found] ` <588992150B702C48B3312184F1B810AD03A497632C@azsmsx501.amr.corp.intel.com> 2009-01-22 11:29 ` Jens Axboe [not found] ` <588992150B702C48B3312184F1B810AD03A4F59632@azsmsx501.amr.corp.intel.com> 2009-01-27 8:28 ` Jens Axboe 2009-01-15 7:24 ` Nick Piggin 2009-01-15 9:46 ` Pekka Enberg 2009-01-15 13:52 ` Matthew Wilcox 2009-01-15 14:42 ` Pekka Enberg 2009-01-16 10:16 ` Pekka Enberg 2009-01-16 10:21 ` Nick Piggin 2009-01-16 10:31 ` Pekka Enberg 2009-01-16 10:42 ` Nick Piggin 2009-01-16 10:55 ` Pekka Enberg 2009-01-19 7:13 ` Nick Piggin 2009-01-19 8:05 ` Pekka Enberg 2009-01-19 8:33 ` Nick Piggin 2009-01-19 8:42 ` Nick Piggin 2009-01-19 8:47 ` Pekka Enberg 2009-01-19 8:57 ` Nick Piggin 2009-01-19 9:48 ` Pekka Enberg 2009-01-19 10:03 ` Nick Piggin 2009-01-16 20:59 ` Christoph Lameter 2009-01-16 0:27 ` Andrew Morton 2009-01-16 4:03 ` Nick Piggin 2009-01-16 4:12 ` Andrew Morton 2009-01-16 6:46 ` Nick Piggin 2009-01-16 6:55 ` Matthew Wilcox 2009-01-16 7:06 ` Nick Piggin 2009-01-16 7:53 ` Zhang, Yanmin 2009-01-16 10:20 ` Andi Kleen 2009-01-20 5:16 ` Zhang, Yanmin 2009-01-21 23:58 ` Christoph Lameter 2009-01-22 8:36 ` Zhang, Yanmin 2009-01-22 9:15 ` Pekka Enberg 2009-01-22 9:28 ` Zhang, Yanmin 2009-01-22 9:47 ` Pekka Enberg 2009-01-23 3:02 ` Zhang, Yanmin 2009-01-23 6:52 ` Pekka Enberg 2009-01-23 8:06 ` Pekka Enberg 2009-01-23 8:30 ` Zhang, Yanmin 2009-01-23 8:40 ` Pekka Enberg 2009-01-23 9:46 ` Pekka Enberg 2009-01-23 15:22 ` Christoph Lameter 2009-01-23 15:31 ` Pekka Enberg 2009-01-23 15:55 ` Christoph Lameter 2009-01-23 16:01 ` Pekka Enberg 2009-01-24 2:55 ` Zhang, Yanmin 2009-01-24 7:36 ` Pekka Enberg 2009-02-12 5:22 ` Zhang, Yanmin 2009-02-12 5:47 ` Zhang, Yanmin 2009-02-12 15:25 ` Christoph Lameter 2009-02-12 16:07 ` Pekka Enberg 2009-02-12 16:03 ` Pekka Enberg 2009-01-26 17:36 ` Christoph Lameter 2009-02-01 2:52 ` Zhang, Yanmin 2009-01-23 8:33 ` Nick Piggin 2009-01-23 9:02 ` Zhang, Yanmin 2009-01-23 18:40 ` care and feeding of netperf (Re: Mainline kernel OLTP performance update) Rick Jones 2009-01-23 18:51 ` Grant Grundler 2009-01-24 3:03 ` Zhang, Yanmin 2009-01-26 18:26 ` Rick Jones 2009-01-16 7:00 ` Mainline kernel OLTP performance update Andrew Morton 2009-01-16 7:25 ` Nick Piggin 2009-01-16 8:59 ` Nick Piggin 2009-01-16 18:11 ` Rick Jones 2009-01-19 7:43 ` Nick Piggin 2009-01-19 22:19 ` Rick Jones 2009-01-15 14:12 ` James Bottomley 2009-01-15 17:44 ` Andrew Morton 2009-01-15 18:00 ` Matthew Wilcox 2009-01-15 18:14 ` Steven Rostedt 2009-01-15 18:44 ` Gregory Haskins 2009-01-15 18:46 ` Wilcox, Matthew R 2009-01-15 19:44 ` Ma, Chinang 2009-01-16 18:14 ` Gregory Haskins 2009-01-16 19:09 ` Steven Rostedt 2009-01-20 12:45 ` Gregory Haskins 2009-01-15 19:28 ` Ma, Chinang 2009-01-15 16:48 ` Ma, Chinang
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).