From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754565Ab2JPLh4 (ORCPT ); Tue, 16 Oct 2012 07:37:56 -0400 Received: from mail-pa0-f46.google.com ([209.85.220.46]:35172 "EHLO mail-pa0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752217Ab2JPLhz (ORCPT ); Tue, 16 Oct 2012 07:37:55 -0400 Message-ID: <507D470B.7050807@gmail.com> Date: Tue, 16 Oct 2012 19:37:47 +0800 From: Ni zhan Chen User-Agent: Mozilla/5.0 (X11; Linux i686; rv:16.0) Gecko/20121011 Thunderbird/16.0.1 MIME-Version: 1.0 To: "Kirill A. Shutemov" CC: "Kirill A. Shutemov" , Andrew Morton , Andrea Arcangeli , linux-mm@kvack.org, Andi Kleen , "H. Peter Anvin" , linux-kernel@vger.kernel.org Subject: Re: [PATCH v4 00/10, REBASED] Introduce huge zero page References: <1350280859-18801-1-git-send-email-kirill.shutemov@linux.intel.com> <507D2E83.4010702@gmail.com> <20121016105456.GA13265@shutemov.name> <507D4143.3020108@gmail.com> <20121016112845.GA13540@shutemov.name> In-Reply-To: <20121016112845.GA13540@shutemov.name> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 10/16/2012 07:28 PM, Kirill A. Shutemov wrote: > On Tue, Oct 16, 2012 at 07:13:07PM +0800, Ni zhan Chen wrote: >> On 10/16/2012 06:54 PM, Kirill A. Shutemov wrote: >>> On Tue, Oct 16, 2012 at 05:53:07PM +0800, Ni zhan Chen wrote: >>>>> By hpa request I've tried alternative approach for hzp implementation (see >>>>> Virtual huge zero page patchset): pmd table with all entries set to zero >>>>> page. This way should be more cache friendly, but it increases TLB >>>>> pressure. >>>> Thanks for your excellent works. But could you explain me why >>>> current implementation not cache friendly and hpa's request cache >>>> friendly? Thanks in advance. >>> In workloads like microbenchmark1 you need N * size(zero page) cache >>> space to get zero page fully cached, where N is cache associativity. >>> If zero page is 2M, cache pressure is significant. >>> >>> On other hand with table of 4k zero pages (hpa's proposal) will increase >>> pressure on TLB, since we have more pages for the same memory area. So we >>> have to do more page translation in this case. >>> >>> On my test machine with simple memcmp() virtual huge zero page is faster. >>> But it highly depends on TLB size, cache size, memory access and page >>> translation costs. >>> >>> It looks like cache size in modern processors grows faster than TLB size. >> Oh, I see, thanks for your quick response. Another one question below, >> >>>>> The problem with virtual huge zero page: it requires per-arch enabling. >>>>> We need a way to mark that pmd table has all ptes set to zero page. >>>>> >>>>> Some numbers to compare two implementations (on 4s Westmere-EX): >>>>> >>>>> Mirobenchmark1 >>>>> ============== >>>>> >>>>> test: >>>>> posix_memalign((void **)&p, 2 * MB, 8 * GB); >>>>> for (i = 0; i < 100; i++) { >>>>> assert(memcmp(p, p + 4*GB, 4*GB) == 0); >>>>> asm volatile ("": : :"memory"); >>>>> } >>>>> >>>>> hzp: >>>>> Performance counter stats for './test_memcmp' (5 runs): >>>>> >>>>> 32356.272845 task-clock # 0.998 CPUs utilized ( +- 0.13% ) >>>>> 40 context-switches # 0.001 K/sec ( +- 0.94% ) >>>>> 0 CPU-migrations # 0.000 K/sec >>>>> 4,218 page-faults # 0.130 K/sec ( +- 0.00% ) >>>>> 76,712,481,765 cycles # 2.371 GHz ( +- 0.13% ) [83.31%] >>>>> 36,279,577,636 stalled-cycles-frontend # 47.29% frontend cycles idle ( +- 0.28% ) [83.35%] >>>>> 1,684,049,110 stalled-cycles-backend # 2.20% backend cycles idle ( +- 2.96% ) [66.67%] >>>>> 134,355,715,816 instructions # 1.75 insns per cycle >>>>> # 0.27 stalled cycles per insn ( +- 0.10% ) [83.35%] >>>>> 13,526,169,702 branches # 418.039 M/sec ( +- 0.10% ) [83.31%] >>>>> 1,058,230 branch-misses # 0.01% of all branches ( +- 0.91% ) [83.36%] >>>>> >>>>> 32.413866442 seconds time elapsed ( +- 0.13% ) >>>>> >>>>> vhzp: >>>>> Performance counter stats for './test_memcmp' (5 runs): >>>>> >>>>> 30327.183829 task-clock # 0.998 CPUs utilized ( +- 0.13% ) >>>>> 38 context-switches # 0.001 K/sec ( +- 1.53% ) >>>>> 0 CPU-migrations # 0.000 K/sec >>>>> 4,218 page-faults # 0.139 K/sec ( +- 0.01% ) >>>>> 71,964,773,660 cycles # 2.373 GHz ( +- 0.13% ) [83.35%] >>>>> 31,191,284,231 stalled-cycles-frontend # 43.34% frontend cycles idle ( +- 0.40% ) [83.32%] >>>>> 773,484,474 stalled-cycles-backend # 1.07% backend cycles idle ( +- 6.61% ) [66.67%] >>>>> 134,982,215,437 instructions # 1.88 insns per cycle >>>>> # 0.23 stalled cycles per insn ( +- 0.11% ) [83.32%] >>>>> 13,509,150,683 branches # 445.447 M/sec ( +- 0.11% ) [83.34%] >>>>> 1,017,667 branch-misses # 0.01% of all branches ( +- 1.07% ) [83.32%] >>>>> >>>>> 30.381324695 seconds time elapsed ( +- 0.13% ) >>>> Could you tell me which data I should care in this performance >>>> counter. And what's the benefit of your current implementation >>>> compare to hpa's request? >> Sorry for my unintelligent. Could you tell me which data I should >> care in this performance counter stats. The same question about the >> second benchmark counter stats, thanks in adance. :-) > I've missed relevant counters in this run, you can see them in the second > benchmark. > > Relevant counters: > L1-dcache-*, LLC-*: shows cache related stats (hits/misses); > dTLB-*: shows data TLB hits and misses. > > Indirect relevant counters: > stalled-cycles-*: how long CPU pipeline has to wait for data. Oh, I see, thanks for your patient. :-) > >>>>> Mirobenchmark2 >>>>> ============== >>>>> >>>>> test: >>>>> posix_memalign((void **)&p, 2 * MB, 8 * GB); >>>>> for (i = 0; i < 1000; i++) { >>>>> char *_p = p; >>>>> while (_p < p+4*GB) { >>>>> assert(*_p == *(_p+4*GB)); >>>>> _p += 4096; >>>>> asm volatile ("": : :"memory"); >>>>> } >>>>> } >>>>> >>>>> hzp: >>>>> Performance counter stats for 'taskset -c 0 ./test_memcmp2' (5 runs): >>>>> >>>>> 3505.727639 task-clock # 0.998 CPUs utilized ( +- 0.26% ) >>>>> 9 context-switches # 0.003 K/sec ( +- 4.97% ) >>>>> 4,384 page-faults # 0.001 M/sec ( +- 0.00% ) >>>>> 8,318,482,466 cycles # 2.373 GHz ( +- 0.26% ) [33.31%] >>>>> 5,134,318,786 stalled-cycles-frontend # 61.72% frontend cycles idle ( +- 0.42% ) [33.32%] >>>>> 2,193,266,208 stalled-cycles-backend # 26.37% backend cycles idle ( +- 5.51% ) [33.33%] >>>>> 9,494,670,537 instructions # 1.14 insns per cycle >>>>> # 0.54 stalled cycles per insn ( +- 0.13% ) [41.68%] >>>>> 2,108,522,738 branches # 601.451 M/sec ( +- 0.09% ) [41.68%] >>>>> 158,746 branch-misses # 0.01% of all branches ( +- 1.60% ) [41.71%] >>>>> 3,168,102,115 L1-dcache-loads >>>>> # 903.693 M/sec ( +- 0.11% ) [41.70%] >>>>> 1,048,710,998 L1-dcache-misses >>>>> # 33.10% of all L1-dcache hits ( +- 0.11% ) [41.72%] >>>>> 1,047,699,685 LLC-load >>>>> # 298.854 M/sec ( +- 0.03% ) [33.38%] >>>>> 2,287 LLC-misses >>>>> # 0.00% of all LL-cache hits ( +- 8.27% ) [33.37%] >>>>> 3,166,187,367 dTLB-loads >>>>> # 903.147 M/sec ( +- 0.02% ) [33.35%] >>>>> 4,266,538 dTLB-misses >>>>> # 0.13% of all dTLB cache hits ( +- 0.03% ) [33.33%] >>>>> >>>>> 3.513339813 seconds time elapsed ( +- 0.26% ) >>>>> >>>>> vhzp: >>>>> Performance counter stats for 'taskset -c 0 ./test_memcmp2' (5 runs): >>>>> >>>>> 27313.891128 task-clock # 0.998 CPUs utilized ( +- 0.24% ) >>>>> 62 context-switches # 0.002 K/sec ( +- 0.61% ) >>>>> 4,384 page-faults # 0.160 K/sec ( +- 0.01% ) >>>>> 64,747,374,606 cycles # 2.370 GHz ( +- 0.24% ) [33.33%] >>>>> 61,341,580,278 stalled-cycles-frontend # 94.74% frontend cycles idle ( +- 0.26% ) [33.33%] >>>>> 56,702,237,511 stalled-cycles-backend # 87.57% backend cycles idle ( +- 0.07% ) [33.33%] >>>>> 10,033,724,846 instructions # 0.15 insns per cycle >>>>> # 6.11 stalled cycles per insn ( +- 0.09% ) [41.65%] >>>>> 2,190,424,932 branches # 80.195 M/sec ( +- 0.12% ) [41.66%] >>>>> 1,028,630 branch-misses # 0.05% of all branches ( +- 1.50% ) [41.66%] >>>>> 3,302,006,540 L1-dcache-loads >>>>> # 120.891 M/sec ( +- 0.11% ) [41.68%] >>>>> 271,374,358 L1-dcache-misses >>>>> # 8.22% of all L1-dcache hits ( +- 0.04% ) [41.66%] >>>>> 20,385,476 LLC-load >>>>> # 0.746 M/sec ( +- 1.64% ) [33.34%] >>>>> 76,754 LLC-misses >>>>> # 0.38% of all LL-cache hits ( +- 2.35% ) [33.34%] >>>>> 3,309,927,290 dTLB-loads >>>>> # 121.181 M/sec ( +- 0.03% ) [33.34%] >>>>> 2,098,967,427 dTLB-misses >>>>> # 63.41% of all dTLB cache hits ( +- 0.03% ) [33.34%] >>>>> >>>>> 27.364448741 seconds time elapsed ( +- 0.24% ) >>>> For this case, the same question as above, thanks in adance. :-)