From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S966593AbdAKN1U (ORCPT ); Wed, 11 Jan 2017 08:27:20 -0500 Received: from mx1.redhat.com ([209.132.183.28]:40384 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S966554AbdAKN1S (ORCPT ); Wed, 11 Jan 2017 08:27:18 -0500 Date: Wed, 11 Jan 2017 14:27:12 +0100 From: Jesper Dangaard Brouer To: Mel Gorman Cc: Linux Kernel , Linux-MM , Hillf Danton , brouer@redhat.com Subject: Re: [PATCH 3/4] mm, page_allocator: Only use per-cpu allocator for irq-safe requests Message-ID: <20170111142712.5fd8bea8@redhat.com> In-Reply-To: <20170111134420.368efb9e@redhat.com> References: <20170109163518.6001-1-mgorman@techsingularity.net> <20170109163518.6001-4-mgorman@techsingularity.net> <20170111134420.368efb9e@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.27]); Wed, 11 Jan 2017 13:27:19 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, 11 Jan 2017 13:44:20 +0100 Jesper Dangaard Brouer wrote: > On Mon, 9 Jan 2017 16:35:17 +0000 Mel Gorman wrote: > > > The following is results from a page allocator micro-benchmark. Only > > order-0 is interesting as higher orders do not use the per-cpu allocator > > Micro-benchmarked with [1] page_bench02: > modprobe page_bench02 page_order=0 run_flags=$((2#010)) loops=$((10**8)); \ > rmmod page_bench02 ; dmesg --notime | tail -n 4 > > Compared to baseline: 213 cycles(tsc) 53.417 ns > - against this : 184 cycles(tsc) 46.056 ns > - Saving : -29 cycles > - Very close to expected 27 cycles saving [see below [2]] When perf benchmarking I noticed that the "summed" children perf overhead from calling alloc_pages_current() is 65.05%. Compared to "free-path" of summed 28.28% of calls "under" __free_pages(). This is caused by CONFIG_NUMA=y, as call path is long with NUMA (and other helpers are also non-inlined calls): alloc_pages -> alloc_pages_current -> __alloc_pages_nodemask -> get_page_from_freelist Without NUMA the call levels gets compacted by inlining to: __alloc_pages_nodemask -> get_page_from_freelist After disabling NUMA, the split between alloc(48.80%) vs. free(42.67%) side is more balanced. Saving by disabling CONFIG_NUMA of: - CONFIG_NUMA=y : 184 cycles(tsc) 46.056 ns - CONFIG_NUMA=n : 143 cycles(tsc) 35.913 ns - Saving: : 41 cycles (approx 22%) I would conclude, there is room for improvements with CONFIG_NUMA code path case. Lets followup on that in a later patch series... > > Signed-off-by: Mel Gorman > > Acked-by: Hillf Danton > > Acked-by: Jesper Dangaard Brouer > > [1] https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/bench > - > Best regards, > Jesper Dangaard Brouer > MSc.CS, Principal Kernel Engineer at Red Hat > LinkedIn: http://www.linkedin.com/in/brouer > > [2] Expected saving comes from Mel removing a local_irq_{save,restore} > and adding a preempt_{disable,enable} instead. > > Micro benchmarking via time_bench_sample[3], we get the cost of these > operations: > > time_bench: Type:for_loop Per elem: 0 cycles(tsc) 0.232 ns (step:0) > time_bench: Type:spin_lock_unlock Per elem: 33 cycles(tsc) 8.334 ns (step:0) > time_bench: Type:spin_lock_unlock_irqsave Per elem: 62 cycles(tsc) 15.607 ns (step:0) > time_bench: Type:irqsave_before_lock Per elem: 57 cycles(tsc) 14.344 ns (step:0) > time_bench: Type:spin_lock_unlock_irq Per elem: 34 cycles(tsc) 8.560 ns (step:0) > time_bench: Type:simple_irq_disable_before_lock Per elem: 37 cycles(tsc) 9.289 ns (step:0) > time_bench: Type:local_BH_disable_enable Per elem: 19 cycles(tsc) 4.920 ns (step:0) > time_bench: Type:local_IRQ_disable_enable Per elem: 7 cycles(tsc) 1.864 ns (step:0) > time_bench: Type:local_irq_save_restore Per elem: 38 cycles(tsc) 9.665 ns (step:0) > [Mel's patch removes a ^^^^^^^^^^^^^^^^] ^^^^^^^^^ expected saving - preempt cost > time_bench: Type:preempt_disable_enable Per elem: 11 cycles(tsc) 2.794 ns (step:0) > [adds a preempt ^^^^^^^^^^^^^^^^^^^^^^] ^^^^^^^^^ adds this cost > time_bench: Type:funcion_call_cost Per elem: 6 cycles(tsc) 1.689 ns (step:0) > time_bench: Type:func_ptr_call_cost Per elem: 11 cycles(tsc) 2.767 ns (step:0) > time_bench: Type:page_alloc_put Per elem: 211 cycles(tsc) 52.803 ns (step:0) > > Thus, expected improvement is: 38-11 = 27 cycles. > > [3] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/time_bench_sample.c > > CPU used: Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz > > Config options of interest: > CONFIG_NUMA=y > CONFIG_DEBUG_LIST=n > CONFIG_VM_EVENT_COUNTERS=y -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qk0-f198.google.com (mail-qk0-f198.google.com [209.85.220.198]) by kanga.kvack.org (Postfix) with ESMTP id 6B8916B0033 for ; Wed, 11 Jan 2017 08:27:20 -0500 (EST) Received: by mail-qk0-f198.google.com with SMTP id d201so142357047qkg.2 for ; Wed, 11 Jan 2017 05:27:20 -0800 (PST) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id t184si3763513qkb.110.2017.01.11.05.27.19 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 11 Jan 2017 05:27:19 -0800 (PST) Date: Wed, 11 Jan 2017 14:27:12 +0100 From: Jesper Dangaard Brouer Subject: Re: [PATCH 3/4] mm, page_allocator: Only use per-cpu allocator for irq-safe requests Message-ID: <20170111142712.5fd8bea8@redhat.com> In-Reply-To: <20170111134420.368efb9e@redhat.com> References: <20170109163518.6001-1-mgorman@techsingularity.net> <20170109163518.6001-4-mgorman@techsingularity.net> <20170111134420.368efb9e@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Mel Gorman Cc: Linux Kernel , Linux-MM , Hillf Danton , brouer@redhat.com On Wed, 11 Jan 2017 13:44:20 +0100 Jesper Dangaard Brouer wrote: > On Mon, 9 Jan 2017 16:35:17 +0000 Mel Gorman wrote: > > > The following is results from a page allocator micro-benchmark. Only > > order-0 is interesting as higher orders do not use the per-cpu allocator > > Micro-benchmarked with [1] page_bench02: > modprobe page_bench02 page_order=0 run_flags=$((2#010)) loops=$((10**8)); \ > rmmod page_bench02 ; dmesg --notime | tail -n 4 > > Compared to baseline: 213 cycles(tsc) 53.417 ns > - against this : 184 cycles(tsc) 46.056 ns > - Saving : -29 cycles > - Very close to expected 27 cycles saving [see below [2]] When perf benchmarking I noticed that the "summed" children perf overhead from calling alloc_pages_current() is 65.05%. Compared to "free-path" of summed 28.28% of calls "under" __free_pages(). This is caused by CONFIG_NUMA=y, as call path is long with NUMA (and other helpers are also non-inlined calls): alloc_pages -> alloc_pages_current -> __alloc_pages_nodemask -> get_page_from_freelist Without NUMA the call levels gets compacted by inlining to: __alloc_pages_nodemask -> get_page_from_freelist After disabling NUMA, the split between alloc(48.80%) vs. free(42.67%) side is more balanced. Saving by disabling CONFIG_NUMA of: - CONFIG_NUMA=y : 184 cycles(tsc) 46.056 ns - CONFIG_NUMA=n : 143 cycles(tsc) 35.913 ns - Saving: : 41 cycles (approx 22%) I would conclude, there is room for improvements with CONFIG_NUMA code path case. Lets followup on that in a later patch series... > > Signed-off-by: Mel Gorman > > Acked-by: Hillf Danton > > Acked-by: Jesper Dangaard Brouer > > [1] https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/bench > - > Best regards, > Jesper Dangaard Brouer > MSc.CS, Principal Kernel Engineer at Red Hat > LinkedIn: http://www.linkedin.com/in/brouer > > [2] Expected saving comes from Mel removing a local_irq_{save,restore} > and adding a preempt_{disable,enable} instead. > > Micro benchmarking via time_bench_sample[3], we get the cost of these > operations: > > time_bench: Type:for_loop Per elem: 0 cycles(tsc) 0.232 ns (step:0) > time_bench: Type:spin_lock_unlock Per elem: 33 cycles(tsc) 8.334 ns (step:0) > time_bench: Type:spin_lock_unlock_irqsave Per elem: 62 cycles(tsc) 15.607 ns (step:0) > time_bench: Type:irqsave_before_lock Per elem: 57 cycles(tsc) 14.344 ns (step:0) > time_bench: Type:spin_lock_unlock_irq Per elem: 34 cycles(tsc) 8.560 ns (step:0) > time_bench: Type:simple_irq_disable_before_lock Per elem: 37 cycles(tsc) 9.289 ns (step:0) > time_bench: Type:local_BH_disable_enable Per elem: 19 cycles(tsc) 4.920 ns (step:0) > time_bench: Type:local_IRQ_disable_enable Per elem: 7 cycles(tsc) 1.864 ns (step:0) > time_bench: Type:local_irq_save_restore Per elem: 38 cycles(tsc) 9.665 ns (step:0) > [Mel's patch removes a ^^^^^^^^^^^^^^^^] ^^^^^^^^^ expected saving - preempt cost > time_bench: Type:preempt_disable_enable Per elem: 11 cycles(tsc) 2.794 ns (step:0) > [adds a preempt ^^^^^^^^^^^^^^^^^^^^^^] ^^^^^^^^^ adds this cost > time_bench: Type:funcion_call_cost Per elem: 6 cycles(tsc) 1.689 ns (step:0) > time_bench: Type:func_ptr_call_cost Per elem: 11 cycles(tsc) 2.767 ns (step:0) > time_bench: Type:page_alloc_put Per elem: 211 cycles(tsc) 52.803 ns (step:0) > > Thus, expected improvement is: 38-11 = 27 cycles. > > [3] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/time_bench_sample.c > > CPU used: Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz > > Config options of interest: > CONFIG_NUMA=y > CONFIG_DEBUG_LIST=n > CONFIG_VM_EVENT_COUNTERS=y -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org