From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id D6E5CC004D4 for ; Thu, 19 Jan 2023 15:53:42 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231406AbjASPxl (ORCPT ); Thu, 19 Jan 2023 10:53:41 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:51680 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231297AbjASPxj (ORCPT ); Thu, 19 Jan 2023 10:53:39 -0500 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 846671BDC; Thu, 19 Jan 2023 07:53:37 -0800 (PST) Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id BF7631FB; Thu, 19 Jan 2023 07:54:18 -0800 (PST) Received: from [10.57.47.54] (unknown [10.57.47.54]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 501D43F445; Thu, 19 Jan 2023 07:53:34 -0800 (PST) Message-ID: <5fc1e764-bafc-34e5-5ff9-8b62af99ac10@arm.com> Date: Thu, 19 Jan 2023 15:53:32 +0000 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.4.2 Subject: Re: [PATCH v3 0/7] perf cs_etm: Basic support for virtual/kernel timestamps Content-Language: en-US From: James Clark To: Tanmay Jagdale Cc: Sunil Kovvuri Goutham , George Cherian , Bharat Bhushan , Mathieu Poirier , John Garry , Will Deacon , Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Mark Rutland , Alexander Shishkin , Jiri Olsa , Namhyung Kim , "coresight@lists.linaro.org" , "linux-arm-kernel@lists.infradead.org" , "linux-kernel@vger.kernel.org" , "linux-perf-users@vger.kernel.org" , "leo.yan@linaro.org" , "mike.leach@linaro.org" References: <20230103162042.423694-1-james.clark@arm.com> <5acce414-eabb-3d22-4907-da6b64b85a9c@arm.com> <70a94ad3-fe5d-b013-7f6b-dd83d2332e0e@arm.com> In-Reply-To: <70a94ad3-fe5d-b013-7f6b-dd83d2332e0e@arm.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 13/01/2023 11:12, James Clark wrote: > > > On 12/01/2023 15:33, Tanmay Jagdale wrote: >> Hi James >> >>>>>> base-commit: 09e6f9f98370be9a9f8978139e0eb1be87d1125f >>>>> I have tested this patch set on our platform and able to see updated >>>>> timestamp values in perf script's output. >>>>> >>>>> $ perf record -e cs_etm/cycacc,@tmc_etr0/k -C 9 taskset -c 9 sleep 2 >>>>> $ perf script --itrace=i1ns --ns -Fcomm,tid,pid,time,cpu,event,ip,sym,addr,symoff,flags,callindent >>>>> >>>>> At certain points noticed that hardware emits same timestamp packets >>>>> but with updated cycle count (CC) values. A small snippet of the log: >>>>> >>>>> Idx:100; ID:12; I_ADDR_S_IS0 : Address, Short, IS0.; Addr=0xFFFF8000086A761C ~[0x761C] >>>>> Idx:103; ID:12; I_TIMESTAMP : Timestamp.; Updated val = 0x2f373e37e02; CC=0x3d >>>>> Idx:107; ID:12; I_ATOM_F2 : Atom format 2.; EN >>>>> Idx:108; ID:12; I_TIMESTAMP : Timestamp.; Updated val = 0x2f373e37e02; CC=0x3f >>>>> Idx:112; ID:12; I_ATOM_F1 : Atom format 1.; N >>>>> Idx:113; ID:12; I_TIMESTAMP : Timestamp.; Updated val = 0x2f373e37e02; CC=0x45 >>>>> Idx:116; ID:12; I_ATOM_F1 : Atom format 1.; E >>>>> Idx:117; ID:12; I_ADDR_S_IS0 : Address, Short, IS0.; Addr=0xFFFF8000086B52D4 ~[0x152D4] >>>>> >>>>> Since the source of timestamp is the Generic Timer block and the CPUs >>>>> run at higher frequencies, this behaviour could be possible on high >>>>> performance ARM cores. >>>>> >>>>> Having consecutive timestamps with same value is resulting in a >>>>> slightly jumbled order (in nanosecs) in perf script's time column. >>>>> A snippet corresponding to the Coresight trace data mentioned above: >>>>> ... >>>>> perf 965/965 [001] 3182.286629044: instructions:k: return 0 ffff8000086a761c coresight_timeout+0xc8 >>>>> perf 965/965 [001] 3182.286629044: instructions:k: return 0 ffff8000086a7620 coresight_timeout+0xcc >>>>> perf 965/965 [001] 3182.286629046: instructions:k: jmp 0 ffff8000086a75c8 coresight_timeout+0x74 >>>>> perf 965/965 [001] 3182.286629046: instructions:k: jmp 0 ffff8000086a75cc coresight_timeout+0x78 >>>>> perf 965/965 [001] 3182.286629044: instructions:k: jcc 0 ffff8000086a75d0 coresight_timeout+0x7c >>>>> perf 965/965 [001] 3182.286629044: instructions:k: jcc 0 ffff8000086a75d4 coresight_timeout+0x80 >>>>> perf 965/965 [001] 3182.286629044: instructions:k: jcc 0 ffff8000086a75d8 coresight_timeout+0x84 >>>>> perf 965/965 [001] 3182.286629044: instructions:k: jcc 0 ffff8000086a75dc coresight_timeout+0x88 >>>>> perf 965/965 [001] 3182.286629044: instructions:k: jcc 0 ffff8000086a75e0 coresight_timeout+0x8c >>>>> perf 965/965 [001] 3182.286629044: instructions:k: jcc 0 ffff8000086a75e4 coresight_timeout+0x90 >>>>> perf 965/965 [001] 3182.286629044: instructions:k: jcc 0 ffff8000086a75e8 coresight_timeout+0x94 >>>>> perf 965/965 [001] 3182.286629044: instructions:k: jcc 0 ffff8000086a75ec coresight_timeout+0x98 >>>>> >>>>> Perf's do_soft_timestamp() logic in cs_etm_decoder.c file is incrementing >>>>> the HW timestamp based on instruction count. Since the next timestamp >>>>> also has the same value, it could be leading to this jumbled order. >>>>> >>>>> We would like to know if this has been observed on other platforms ? >>>>> And what could be a solution in SW for this ? >>>> >>>> Nice catch. If I'm understanding this correctly it looks like the issue >>>> is a combination of the cycle count in the packet being ignored by Perf, >>>> and the instruction count being reset to 0 when a new timestamp is received. >>>> >>>> It looks like it can be fixed by some combination of combining the cycle >>>> count and instruction count and maybe not resetting instruction count if >>>> the newly received timestamp is the same as the last one. I will look >>>> into this. >>>> >>>> We haven't noticed it on any other platforms, but we haven't been >>>> looking too closely at the timestamps until now. Perhaps I can add a >>>> test that checks if the trace in a known function goes in the correct >>>> time order. >>>> >>> >>> I'm thinking of something like the following patch to fix the ordering. >>> It doesn't use the cycle count, but I'm not sure if that would be >>> worthwhile in the end, considering that it would have sub nanosecond >>> resolution so wouldn't affect the Perf timestamps: >> Thanks for coming up with a quick solution for this issue ! >> >>> >>> >>> diff --git a/tools/perf/util/cs-etm-decoder/cs-etm-decoder.c b/tools/perf/util/cs-etm-decoder/cs-etm-decoder.c >>> index 31fa3b45134a..08a028e3e87a 100644 >>> --- a/tools/perf/util/cs-etm-decoder/cs-etm-decoder.c >>> +++ b/tools/perf/util/cs-etm-decoder/cs-etm-decoder.c >>> @@ -112,6 +112,19 @@ int cs_etm_decoder__get_packet(struct cs_etm_packet_queue *packet_queue, >>> return 1; >>> } >>> >>> +static u64 cs_etm_decoder__instr_count_to_ns(u64 instr_count) >>> +{ >>> + /* >>> + * Assume a maximum of 0.1ns elapsed per instruction. This would be the >>> + * case with a theoretical 10GHz core executing 1 instruction per cycle. >>> + * Used to estimate the sample time for synthesized instructions because >>> + * Coresight only emits a timestamp for a range of instructions rather >>> + * than per instruction. >>> + */ >>> + const int INSTR_PER_NS = 10; >>> + >>> + return instr_count / INSTR_PER_NS; >>> +} >>> static int cs_etm_decoder__gen_etmv3_config(struct cs_etm_trace_params *params, >>> ocsd_etmv3_cfg *config) >>> { >>> @@ -267,7 +280,7 @@ cs_etm_decoder__do_soft_timestamp(struct cs_etm_queue *etmq, >>> packet_queue->cs_timestamp = packet_queue->next_cs_timestamp; >>> >>> /* Estimate the timestamp for the next range packet */ >>> - packet_queue->next_cs_timestamp += packet_queue->instr_count; >>> + packet_queue->next_cs_timestamp += cs_etm_decoder__instr_count_to_ns(packet_queue->instr_count); >>> packet_queue->instr_count = 0; >>> >>> /* Tell the front end which traceid_queue needs attention */ >>> @@ -295,11 +308,17 @@ cs_etm_decoder__do_hard_timestamp(struct cs_etm_queue *etmq, >>> * hence asking the decoder to keep decoding rather than stopping. >>> */ >>> if (packet_queue->cs_timestamp) { >>> - packet_queue->next_cs_timestamp = elem->timestamp; >>> + /* >>> + * Don't allow next_cs_timestamp to be less than the last one estimated by >>> + * cs_etm_decoder__do_soft_timestamp() otherwise new instructions would >>> + * appear to go back in time. In theory this should never happen, but if >>> + * it did, then next_cs_timestamp should eventually catch up to real time >>> + * unless every single range was predicted to be too long for some reason. >>> + */ >>> + packet_queue->next_cs_timestamp = max(elem->timestamp, packet_queue->next_cs_timestamp); >>> return OCSD_RESP_CONT; >>> } >>> >>> - >>> if (!elem->timestamp) { >>> /* >>> * Zero timestamps can be seen due to misconfiguration or hardware bugs. >>> @@ -312,7 +331,7 @@ cs_etm_decoder__do_hard_timestamp(struct cs_etm_queue *etmq, >>> ". Decoding may be improved by prepending 'Z' to your current --itrace >>> arguments.\n", >>> indx); >>> >>> - } else if (packet_queue->instr_count > elem->timestamp) { >>> + } else if (cs_etm_decoder__instr_count_to_ns(packet_queue->instr_count) > elem->timestamp) { >>> /* >>> * Sanity check that the elem->timestamp - packet_queue->instr_count would not >>> * result in an underflow. Warn and clamp at 0 if it would. >>> @@ -327,7 +346,8 @@ cs_etm_decoder__do_hard_timestamp(struct cs_etm_queue *etmq, >>> * which instructions started by subtracting the number of instructions >>> * executed to the timestamp. >>> */ >>> - packet_queue->cs_timestamp = elem->timestamp - packet_queue->instr_count; >>> + packet_queue->cs_timestamp = elem->timestamp - >>> + cs_etm_decoder__instr_count_to_ns(packet_queue->instr_count); >>> } >>> packet_queue->next_cs_timestamp = elem->timestamp; >>> packet_queue->instr_count = 0; >> I have tested this diff along with the patchset and found that >> timestamps are generated/estimated in an increasing order. >> >> However, found few corner cases where they weren't in order. >> I'm currently gathering more information on these corner cases. >> >> Used the following steps to find inconsistencies: >> $ perf record -e cs_etm/@tmc_etr0/k -C 4 taskset -c 4 sleep 1 >> $ perf script --itrace=i1ns --ns -Fcomm,tid,pid,time,cpu,event,ip,sym,addr,symoff,flags,callindent > itrace >> $ sed 's/://g' itrace | awk -F '.' ' { print $2 } ' | awk '{ if ($1 < prev) { print "line:" NR " " $0 } {prev=$1}}' >> >> Setting INSTR_PER_NS to higher values (14,16,20 etc) results in >> lower inconsistencies but it would defeat the purpose of >> estimating values in software since we could be returning 0 >> from __instr_count_to_ns() in many cases. >> >> Please share your thoughts. > > I believe this could be because I was adding nanosecond values > to the raw coresight counter before it was converted which was > a mistake. I have made another change to move the conversion > as early as possible to avoid mistakes like this. > > Increasing the INSTR_PER_NS shouldn't be required because of > the max() call, in fact INSTR_PER_NS isn't necessarily > needed at all, I just thought it would make it more accurate. > > I'll test the new version on the file that you sent before > pasting it here. If you have another recording that has more > of the edge cases then please also upload that one too. Hi Tanmay, I've sent v4 and I don't see any issues now with the files that you sent. There were a couple of different edge cases that I listed in the last commit message. Let me know if you find any more, otherwise if you leave your review or tested-by tag that would be great. Thanks James