[PATCH 0/2] x86: Add xsave/xrstor Latency Trace Events and Consuming Script

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/2] x86: Add xsave/xrstor Latency Trace Events and Consuming Script
@ 2022-07-23  8:37 Yi Sun
  2022-07-23  8:37 ` [PATCH 1/2] x86/fpu: Measure the Latency of XSAVE and XRSTOR Yi Sun
  2022-07-23  8:38 ` [PATCH 2/2] scripts/x86: Add script to consume trace log of xsave latency Yi Sun
  0 siblings, 2 replies; 10+ messages in thread
From: Yi Sun @ 2022-07-23  8:37 UTC (permalink / raw)
  To: dave.hansen, linux-kernel, x86; +Cc: sohil.mehta, tony.luck, heng.su, yi.sun

This series introduces trace events which can dump the latency of
instructions xsave and xrstor for x86 platform. In internal testing
we compared several optional latency calculation method, believe
it can be a more optimized approach to get the latency.

Also, the series introduces a script to consume trace log, which
leverage sqlite3 to show statistics data such as max, min, latency
and 97% tail latency.

This series has gone through several versions of peer-reviewed
internally. All patches have review tags. It's ready for review by
the x86@kernel.org maintainers.

Yi Sun (2):
  x86/fpu: Measure the Latency of XSAVE and XRSTOR
  scripts/x86: Add script to consume trace log of xsave latency

 arch/x86/include/asm/trace/fpu.h   |  35 +++++
 before submission.               

 arch/x86/kernel/fpu/xstate.h       |  49 ++++++-
 scripts/x86/xsave-latency-trace.sh | 227 +++++++++++++++++++++++++++++
 3 files changed, 306 insertions(+), 5 deletions(-)
 create mode 100755 scripts/x86/xsave-latency-trace.sh

-- 
2.34.1

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 1/2] x86/fpu: Measure the Latency of XSAVE and XRSTOR
  2022-07-23  8:37 [PATCH 0/2] x86: Add xsave/xrstor Latency Trace Events and Consuming Script Yi Sun
@ 2022-07-23  8:37 ` Yi Sun
  2022-07-24 20:54   ` David Laight
  2022-07-28 18:46   ` Sohil Mehta
  2022-07-23  8:38 ` [PATCH 2/2] scripts/x86: Add script to consume trace log of xsave latency Yi Sun
  1 sibling, 2 replies; 10+ messages in thread
From: Yi Sun @ 2022-07-23  8:37 UTC (permalink / raw)
  To: dave.hansen, linux-kernel, x86; +Cc: sohil.mehta, tony.luck, heng.su, yi.sun

Calculate the latency of instructions xsave and xrstor with new trace
points x86_fpu_latency_xsave and x86_fpu_latency_xrstor.

The delta TSC can be calculated within a single trace event. Another
option considered was to have 2 separated trace events marking the
start and finish of the xsave/xrstor instructions. The delta TSC was
calculated from the 2 trace points in user space, but there was
significant overhead added by the trace function itself.

In internal testing, the single trace point option which is
implemented here proved to be more accurate.

Together with the latency, dump the RFBM(requested-feature bitmap)
and XINUSE to identify the type of xsave/xrstor.

Trace log looks like following:
  x86_fpu_latency_xsave: x86/fpu: latency:100 RFBM:0x202e7 XINUSE:0x202
  x86_fpu_latency_xrstor: x86/fpu: latency:99 RFBM:0x202e7 XINUSE:0x202

To enable it, CONFIG_X86_DEBUG_FPU and CONFIG_TRACEPOINTS are required.
The compiler can get rid of all the extra crust when either of the two
configs is disabled.

If both of the two configs are enabled, xsave/xrstor_tracing_enabled
would be reduced to a static check for tracing enabled. Thus, in the
fast path there would be only 2 additional static checks.

Since tracepoints can be enabled dynamically, while the code is
checking tracepoint_enabled(trace_event), the trace_event could be
concurrently enabled. Hence there is probability to get single once
noisy result 'trace_clock() - (-1)' at the moment enabling the trace
points x86_fpu_latency_*. Leave the noise here instead of additional
conditions before calling the x86_fpu_latency_* because that makes the
latency more accurate and it's easy to filer the noise out by the
following consuming script.

Reviewed-by: Sohil Mehta <sohil.mehta@intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Yi Sun <yi.sun@intel.com>

diff --git a/arch/x86/include/asm/trace/fpu.h b/arch/x86/include/asm/trace/fpu.h
index 4645a6334063..b40328f4e90d 100644
--- a/arch/x86/include/asm/trace/fpu.h
+++ b/arch/x86/include/asm/trace/fpu.h
@@ -89,6 +89,41 @@ DEFINE_EVENT(x86_fpu, x86_fpu_xstate_check_failed,
 	TP_ARGS(fpu)
 );
 
+DECLARE_EVENT_CLASS(x86_fpu_latency,
+	TP_PROTO(struct fpstate *fpstate, u64 dtsc),
+	TP_ARGS(fpstate, dtsc),
+
+	TP_STRUCT__entry(
+		__field(struct fpstate *, fpstate)
+		__field(u64, dtsc)
+		__field(u64, rfbm)
+		__field(u64, xinuse)
+	),
+
+	TP_fast_assign(
+		__entry->fpstate = fpstate;
+		__entry->dtsc = dtsc;
+		__entry->rfbm = fpstate->xfeatures;
+		__entry->xinuse = fpstate->regs.xsave.header.xfeatures;
+	),
+
+	TP_printk("x86/fpu: latency:%lld RFBM:0x%llx XINUSE:0x%llx",
+		__entry->dtsc,
+		__entry->rfbm,
+		__entry->xinuse
+	)
+);
+
+DEFINE_EVENT(x86_fpu_latency, x86_fpu_latency_xsave,
+	TP_PROTO(struct fpstate *fpstate, u64 dtsc),
+	TP_ARGS(fpstate, dtsc)
+);
+
+DEFINE_EVENT(x86_fpu_latency, x86_fpu_latency_xrstor,
+	TP_PROTO(struct fpstate *fpstate, u64 dtsc),
+	TP_ARGS(fpstate, dtsc)
+);
+
 #undef TRACE_INCLUDE_PATH
 #define TRACE_INCLUDE_PATH asm/trace/
 #undef TRACE_INCLUDE_FILE
diff --git a/arch/x86/kernel/fpu/xstate.h b/arch/x86/kernel/fpu/xstate.h
index d22ace092ca2..c5d294e5ee52 100644
--- a/arch/x86/kernel/fpu/xstate.h
+++ b/arch/x86/kernel/fpu/xstate.h
@@ -5,6 +5,9 @@
 #include <asm/cpufeature.h>
 #include <asm/fpu/xstate.h>
 #include <asm/fpu/xcr.h>
+#include <asm/trace/fpu.h>
+
+#include <linux/trace_clock.h>
 
 #ifdef CONFIG_X86_64
 DECLARE_PER_CPU(u64, xfd_state);
@@ -68,6 +71,20 @@ static inline u64 xfeatures_mask_independent(void)
 	return XFEATURE_MASK_INDEPENDENT;
 }
 
+static inline bool xsave_tracing_enabled(void)
+{
+	if (!IS_ENABLED(CONFIG_X86_DEBUG_FPU))
+		return false;
+	return tracepoint_enabled(x86_fpu_latency_xsave);
+}
+
+static inline bool xrstor_tracing_enabled(void)
+{
+	if (!IS_ENABLED(CONFIG_X86_DEBUG_FPU))
+		return false;
+	return tracepoint_enabled(x86_fpu_latency_xrstor);
+}
+
 /* XSAVE/XRSTOR wrapper functions */
 
 #ifdef CONFIG_X86_64
@@ -110,7 +127,7 @@ static inline u64 xfeatures_mask_independent(void)
  * original instruction which gets replaced. We need to use it here as the
  * address of the instruction where we might get an exception at.
  */
-#define XSTATE_XSAVE(st, lmask, hmask, err)				\
+#define __XSTATE_XSAVE(st, lmask, hmask, err)				\
 	asm volatile(ALTERNATIVE_2(XSAVE,				\
 				   XSAVEOPT, X86_FEATURE_XSAVEOPT,	\
 				   XSAVES,   X86_FEATURE_XSAVES)	\
@@ -122,11 +139,22 @@ static inline u64 xfeatures_mask_independent(void)
 		     : "D" (st), "m" (*st), "a" (lmask), "d" (hmask)	\
 		     : "memory")
 
+#define XSTATE_XSAVE(fps, lmask, hmask, err)				\
+	do {								\
+		struct fpstate *f = fps;				\
+		u64 tc = -1;						\
+		if (xsave_tracing_enabled())				\
+			tc = trace_clock();				\
+		__XSTATE_XSAVE(&f->regs.xsave, lmask, hmask, err);	\
+		if (xsave_tracing_enabled())				\
+			trace_x86_fpu_latency_xsave(f, trace_clock() - tc);\
+	} while (0)
+
 /*
  * Use XRSTORS to restore context if it is enabled. XRSTORS supports compact
  * XSAVE area format.
  */
-#define XSTATE_XRESTORE(st, lmask, hmask)				\
+#define __XSTATE_XRESTORE(st, lmask, hmask)				\
 	asm volatile(ALTERNATIVE(XRSTOR,				\
 				 XRSTORS, X86_FEATURE_XSAVES)		\
 		     "\n"						\
@@ -136,6 +164,17 @@ static inline u64 xfeatures_mask_independent(void)
 		     : "D" (st), "m" (*st), "a" (lmask), "d" (hmask)	\
 		     : "memory")
 
+#define XSTATE_XRESTORE(fps, lmask, hmask)				\
+	do {								\
+		struct fpstate *f = fps;				\
+		u64 tc = -1;						\
+		if (xrstor_tracing_enabled())				\
+			tc = trace_clock();				\
+		__XSTATE_XRESTORE(&f->regs.xsave, lmask, hmask);	\
+		if (xrstor_tracing_enabled())				\
+			trace_x86_fpu_latency_xrstor(f, trace_clock() - tc);\
+	} while (0)
+
 #if defined(CONFIG_X86_64) && defined(CONFIG_X86_DEBUG_FPU)
 extern void xfd_validate_state(struct fpstate *fpstate, u64 mask, bool rstor);
 #else
@@ -180,7 +219,7 @@ static inline void os_xsave(struct fpstate *fpstate)
 	WARN_ON_FPU(!alternatives_patched);
 	xfd_validate_state(fpstate, mask, false);
 
-	XSTATE_XSAVE(&fpstate->regs.xsave, lmask, hmask, err);
+	XSTATE_XSAVE(fpstate, lmask, hmask, err);
 
 	/* We should never fault when copying to a kernel buffer: */
 	WARN_ON_FPU(err);
@@ -197,7 +236,7 @@ static inline void os_xrstor(struct fpstate *fpstate, u64 mask)
 	u32 hmask = mask >> 32;
 
 	xfd_validate_state(fpstate, mask, true);
-	XSTATE_XRESTORE(&fpstate->regs.xsave, lmask, hmask);
+	XSTATE_XRESTORE(fpstate, lmask, hmask);
 }
 
 /* Restore of supervisor state. Does not require XFD */
@@ -207,7 +246,7 @@ static inline void os_xrstor_supervisor(struct fpstate *fpstate)
 	u32 lmask = mask;
 	u32 hmask = mask >> 32;
 
-	XSTATE_XRESTORE(&fpstate->regs.xsave, lmask, hmask);
+	XSTATE_XRESTORE(fpstate, lmask, hmask);
 }
 
 /*
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 2/2] scripts/x86: Add script to consume trace log of xsave latency
  2022-07-23  8:37 [PATCH 0/2] x86: Add xsave/xrstor Latency Trace Events and Consuming Script Yi Sun
  2022-07-23  8:37 ` [PATCH 1/2] x86/fpu: Measure the Latency of XSAVE and XRSTOR Yi Sun
@ 2022-07-23  8:38 ` Yi Sun
  1 sibling, 0 replies; 10+ messages in thread
From: Yi Sun @ 2022-07-23  8:38 UTC (permalink / raw)
  To: dave.hansen, linux-kernel, x86; +Cc: sohil.mehta, tony.luck, heng.su, yi.sun

Consume the trace log dumped by tracepoint x86_fpu_latency_xsave and
x86_fpu_latency_xrstor, calculate latency ranges for each RFBM and
XINUSE combination including min, max, average and 97% tail latency.

Add the average of 97% tail latency to remove the unreasonable
latency which are introduced by interrupts. By adding the experimental
code disabling interrupts before the calculation of the tsc delta,
it's obvious to get the 3% tail latency has been filtered.

Make use of sqlite3 to make the data statistics more efficient and
concise. The output looks like following:

EVENTs                	RFBM   	XINUSE	lat_min	lat_max	lat_avg	lat_avg(97%)
----------------------	-------	------	-------	-------	-------	------------
x86_fpu_latency_xrstor	0x206e7	0x0   	364    	364    	364    	364
x86_fpu_latency_xrstor	0x206e7	0x202 	112    	1152   	300    	276
x86_fpu_latency_xsave 	0x206e7	0x202 	80     	278    	141    	137
x86_fpu_latency_xsave 	0x206e7	0x246 	108    	234    	180    	177

The xsave latency trace log can be got by two ways:
1. Generated by Kernel debugfs
  echo 1 > /sys/kernel/debug/tracing/events/x86_fpu/enable
  cat /sys/kernel/debug/tracing/trace_pipe > trace-log

2. Generated by helper tool like 'trace-cmd'
   trace-cmd record -e x86_fpu -F <command>
   trace-cmd report > trace-log

Reviewed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Yi Sun <yi.sun@intel.com>

diff --git a/scripts/x86/xsave-latency-trace.sh b/scripts/x86/xsave-latency-trace.sh
new file mode 100755
index 000000000000..d45563984fd6
--- /dev/null
+++ b/scripts/x86/xsave-latency-trace.sh
@@ -0,0 +1,227 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# (c) 2022 Yi Sun <yi.sun@intel.com>
+
+trace_log=$1
+trace_lat_log=".trace_lat_log"
+db_name="db_trace"
+db_file="${db_name}.db"
+table_raw="t_trace"
+table_tail="t_trace_tail"
+table_results="t_results"
+events="x86_fpu_latency_xsave|x86_fpu_latency_xrstor"
+
+# The regex for the trace log. The rough pattern:
+# (proc) (No.cpu) (flags) (timestamp): (tracepoint): latency:(123) RFBM:0x(123) XINUSE:0x(123)$
+# Fold the regex into 3 parts making it easier to read.
+regex1="([^\ ]*)[[:space:]]*\[([0-9]+)\][[:space:]]*(.....\ )?[[:space:]]*"
+regex2="([0-9.]*):[[:space:]]*([^\ :]*):.*latency:([0-9]*)[[:space:]]*"
+regex3="RFBM:(0x[0-9a-f]*)[[:space:]]*XINUSE:(0x[0-9a-f]*)$"
+
+function usage() {
+	echo "This script consumes the tracepoint data, and dumps out the"
+	echo "latency ranges for each RFBM combination."
+	echo "Usage:"
+	echo "$0 <trace-log>"
+	echo "   trace-log:"
+	echo "     Either generated by Kernel sysfs:"
+	echo "       echo 1 > /sys/kernel/debug/tracing/events/x86_fpu/enable"
+	echo "       cat /sys/kernel/debug/tracing/trace_pipe > trace-log"
+	echo ""
+	echo "     Or generate by helper tool like 'trace-cmd':"
+	echo "       trace-cmd record -e x86_fpu"
+	echo "       trace-cmd report > trace-log"
+}
+
+# Check the dependent tools
+# {@}: a list of third-part tools
+function check_packages() {
+	for pack in "$@"; do
+		which $pack >& /dev/null
+		if [[ $? != 0 ]]; then
+			echo "Please install $pack before running this script."
+			exit 1
+		fi
+	done
+}
+
+# Run SQL command with sqlite3
+# ${*}: SQL command fed to sqlite3
+function SQL_CMD() {
+	sqlite3 $db_file "$*"
+}
+
+# Run SQL command with sqlite3 and format the output with headers and column.
+# ${*}: SQL command fed to sqlite3
+function SQL_CMD_HEADER() {
+	sqlite3 -column -header $db_file "$*"
+}
+
+# Create a table in the DB
+# ${1}： name of the table
+function create_table() {
+	if [[ "$1" == "" ]]; then
+		echo "Empty table name!"
+		exit 1
+	fi
+	SQL_CMD "create table $1 (
+		id INTEGER PRIMARY KEY AUTOINCREMENT,
+		process TEXT,
+		cpu INT,
+		timestamp FLOAT,
+		event_name TEXT,
+		lat INT,
+		RFBM INT,
+		XINUSE INT);"
+}
+
+# Round to the nearest whole number
+# ${1}: a float number
+# Output: integer
+function round() {
+	echo "scale=0; ($1+0.5)/1" | bc
+}
+
+# Insert a record in the trace table
+#
+# process cpu  timestamp  event_name  lat  RFBM  XINUSE
+# $2      $3   $4         $5          $6   $7    $8
+
+function insert_line() {
+	if [[ "$1" == "" ]]; then
+		echo "Empty table name!"
+		exit 1
+	fi
+	SQL_CMD "INSERT INTO $1 (process, cpu, timestamp, event_name, lat, RFBM, XINUSE)
+		VALUES (\"$2\", $3, $4, \"$5\", $6, $7, $8);"
+}
+
+# Show the results of the trace statistics
+function get_latency_stat() {
+	SQL_CMD "create table $table_results (
+		id INTEGER PRIMARY KEY AUTOINCREMENT,
+		event_name TEXT,
+		RFBM INT,
+		XINUSE INT,
+		lat_min INT,
+		lat_max INT,
+		lat_avg INT,
+		lat_tail_avg INT);"
+
+	for((i=0; i<$cnt; i++));do
+		event_name=`get_comb_item $table_raw $i event_name`
+		RFBM=`get_comb_item $table_raw $i RFBM`
+		XINUSE=`get_comb_item $table_raw $i XINUSE`
+		lat_min=`get_comb_item $table_raw $i min\(lat\)`
+		lat_max=`get_comb_item $table_raw $i max\(lat\)`
+		lat_avg=`get_comb_item $table_raw $i avg\(lat\)`
+		lat_tail_avg=`get_comb_item $table_tail $i avg\(lat\)`
+
+		lat_avg=`round $lat_avg`
+		lat_tail_avg=`round $lat_tail_avg`
+
+		SQL_CMD "INSERT INTO $table_results
+			(event_name, RFBM,XINUSE, lat_min, lat_max, lat_avg, lat_tail_avg)
+			VALUES (\"$event_name\", $RFBM, $XINUSE, $lat_min, $lat_max,
+			$lat_avg, $lat_tail_avg);"
+	done
+
+	SQL_CMD_HEADER "select event_name[EVENTs],printf('0x%x',RFBM)[RFBM],
+			printf('0x%x',XINUSE)[XINUSE],lat_min,lat_max,lat_avg,
+			lat_tail_avg[lat_avg(97%)]
+			from $table_results;"
+}
+
+# Get the count of the combination of event_name, RFBM, XINUSE amount all lat trace records
+function get_combs_cnt() {
+	SQL_CMD "SELECT event_name, RFBM, XINUSE from $table_raw
+		group by event_name,RFBM,XINUSE;" | wc -l
+}
+
+# Get a specified combination from a table
+# ${1}: name of table
+# ${2}: the order of the combination of event_name, RFBM, XINUSE
+# ${3}: the items which are wanted to be shown
+function get_comb_item() {
+	table=$1
+	cnt=$2
+	col=$3
+	SQL_CMD "SELECT $col from $table group by event_name,RFBM,XINUSE limit $cnt,1;"
+}
+
+# Get count of the records in a given table
+# ${1}: name of the table
+function get_rows_cnt() {
+	table=$1
+	SQL_CMD "SELECT count(*) from $table;"
+}
+
+# Generate a new table from the raw trace table removing 3% tail traces.
+function gen_tail_lat() {
+	cnt=`get_combs_cnt`
+	create_table $table_tail
+
+	for((i=0; i<$cnt; i++));do
+		create_table t$i
+		event_name=`get_comb_item $table_raw $i event_name`
+		RFBM=`get_comb_item $table_raw $i RFBM`
+		XINUSE=`get_comb_item $table_raw $i XINUSE`
+
+		SQL_CMD "insert into t$i(process,cpu,timestamp,event_name,lat,RFBM,XINUSE)
+			select process,cpu,timestamp,event_name,lat,RFBM,XINUSE
+			from $table_raw where event_name=\"$event_name\" and RFBM=$RFBM and
+			XINUSE=$XINUSE ORDER BY lat ASC;"
+
+		row=`get_rows_cnt t$i`
+		row=`echo "scale=0; ($row*0.97 + 0.5)/1" | bc`
+
+		SQL_CMD "insert into $table_tail
+			(process,cpu,timestamp,event_name,lat,RFBM,XINUSE)
+			select process,cpu,timestamp,event_name,lat,RFBM,XINUSE
+			from t$i limit 0,$row;"
+	done
+
+}
+
+if [[ ! -e "$trace_log" || $# != 1 ]];then
+	usage
+	exit 1
+fi
+
+# Check dependency
+# Make sure having following packages
+check_packages sqlite3 bc wc cut
+
+# Filter trace log keeping latency related lines only
+grep -E "$events" $trace_log > $trace_lat_log
+cnt_lines=`wc -l $trace_lat_log | cut -d' ' -f1`
+# Remove the old db file if it existed before creating
+[[ -f $db_file ]] && rm -rf $db_file
+
+create_table $table_raw
+
+# Read each line from the temp file and insert into the table
+i=0
+while IFS= read -r line;
+do
+	((i = i + 1))
+	echo -ne "(${i}/$cnt_lines) Importing trace log into database!\r"
+	if [[ "$line" =~ ${regex1}${regex2}${regex3} ]]; then
+		pname=${BASH_REMATCH[1]}
+		cpu=${BASH_REMATCH[2]}
+		ts=${BASH_REMATCH[4]}
+		ename=${BASH_REMATCH[5]}
+		lat=${BASH_REMATCH[6]}
+		((rfbm=${BASH_REMATCH[7]}))
+		((xinuse=${BASH_REMATCH[8]}))
+
+		insert_line $table_raw $pname $cpu $ts $ename $lat $rfbm $xinuse
+	fi
+done < $trace_lat_log
+
+gen_tail_lat
+get_latency_stat
+
+# Cleanup
+rm -rf $trace_lat_log $db_file
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* RE: [PATCH 1/2] x86/fpu: Measure the Latency of XSAVE and XRSTOR
  2022-07-23  8:37 ` [PATCH 1/2] x86/fpu: Measure the Latency of XSAVE and XRSTOR Yi Sun
@ 2022-07-24 20:54   ` David Laight
  2022-07-24 23:00     ` Luck, Tony
                       ` (2 more replies)
  2022-07-28 18:46   ` Sohil Mehta
  1 sibling, 3 replies; 10+ messages in thread
From: David Laight @ 2022-07-24 20:54 UTC (permalink / raw)
  To: 'Yi Sun', dave.hansen, linux-kernel, x86
  Cc: sohil.mehta, tony.luck, heng.su

From: Yi Sun
> Sent: 23 July 2022 09:38
> 
> Calculate the latency of instructions xsave and xrstor with new trace
> points x86_fpu_latency_xsave and x86_fpu_latency_xrstor.
> 
> The delta TSC can be calculated within a single trace event. Another
> option considered was to have 2 separated trace events marking the
> start and finish of the xsave/xrstor instructions. The delta TSC was
> calculated from the 2 trace points in user space, but there was
> significant overhead added by the trace function itself.
> 
> In internal testing, the single trace point option which is
> implemented here proved to be more accurate.
...

I've done some experiments that measure short instruction latencies.
Basically I found:
1) You need a suitable serialising instruction before and after
   the code being tested - otherwise it can overlap whatever
   you are using for timing.
2) The only reliable counter is the performance monitor clock
   counter - everything else depends on the current cpu frequency.
   On intel cpu the cpu frequency can change all the time.
Allowing for that, and then ignoring complete outliers, I could
get clock-count accurate values for iterations of the IP csum loop.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/2] x86/fpu: Measure the Latency of XSAVE and XRSTOR
  2022-07-24 20:54   ` David Laight
@ 2022-07-24 23:00     ` Luck, Tony
  2022-07-25  7:35       ` David Laight
  2022-07-25 17:44     ` Dave Hansen
  2022-07-26  9:03     ` 'Yi Sun'
  2 siblings, 1 reply; 10+ messages in thread
From: Luck, Tony @ 2022-07-24 23:00 UTC (permalink / raw)
  To: David Laight
  Cc: Sun, Yi, Hansen, Dave, linux-kernel, x86, Mehta, Sohil, Su, Heng

RDTSC has returned values invariant of current frequency since Nehalem (modulo a few hiccoughs). So any CPU with XSAVE/XRESTOR should be safe to measure using TSC.

Sent from my iPhone

> On Jul 24, 2022, at 14:16, David Laight <David.Laight@aculab.com> wrote:
> 
> From: Yi Sun
>> Sent: 23 July 2022 09:38
>> 
>> Calculate the latency of instructions xsave and xrstor with new trace
>> points x86_fpu_latency_xsave and x86_fpu_latency_xrstor.
>> 
>> The delta TSC can be calculated within a single trace event. Another
>> option considered was to have 2 separated trace events marking the
>> start and finish of the xsave/xrstor instructions. The delta TSC was
>> calculated from the 2 trace points in user space, but there was
>> significant overhead added by the trace function itself.
>> 
>> In internal testing, the single trace point option which is
>> implemented here proved to be more accurate.
> ...
> 
> I've done some experiments that measure short instruction latencies.
> Basically I found:
> 1) You need a suitable serialising instruction before and after
>   the code being tested - otherwise it can overlap whatever
>   you are using for timing.
> 2) The only reliable counter is the performance monitor clock
>   counter - everything else depends on the current cpu frequency.
>   On intel cpu the cpu frequency can change all the time.
> Allowing for that, and then ignoring complete outliers, I could
> get clock-count accurate values for iterations of the IP csum loop.
> 
>    David
> 
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: [PATCH 1/2] x86/fpu: Measure the Latency of XSAVE and XRSTOR
  2022-07-24 23:00     ` Luck, Tony
@ 2022-07-25  7:35       ` David Laight
  0 siblings, 0 replies; 10+ messages in thread
From: David Laight @ 2022-07-25  7:35 UTC (permalink / raw)
  To: 'Luck, Tony'
  Cc: Sun, Yi, Hansen, Dave, linux-kernel, x86, Mehta, Sohil, Su, Heng

From: Luck, Tony
> Sent: 25 July 2022 00:00
> 
> RDTSC has returned values invariant of current frequency since Nehalem (modulo a few hiccoughs). So
> any CPU with XSAVE/XRESTOR should be safe to measure using TSC.

Indeed - that it exactly why you can't use the TSC to measure
instruction latency any more.
You need to measure latency in clocks, not time.

On cpu where all the cores run at the same frequency you can
see the effect by spinning one core in userspace.
Running 'while :; do :; done' from a shell prompt is pretty
effective at spinning in userspace.

	David

> 
> Sent from my iPhone
> 
> > On Jul 24, 2022, at 14:16, David Laight <David.Laight@aculab.com> wrote:
> >
> > From: Yi Sun
> >> Sent: 23 July 2022 09:38
> >>
> >> Calculate the latency of instructions xsave and xrstor with new trace
> >> points x86_fpu_latency_xsave and x86_fpu_latency_xrstor.
> >>
> >> The delta TSC can be calculated within a single trace event. Another
> >> option considered was to have 2 separated trace events marking the
> >> start and finish of the xsave/xrstor instructions. The delta TSC was
> >> calculated from the 2 trace points in user space, but there was
> >> significant overhead added by the trace function itself.
> >>
> >> In internal testing, the single trace point option which is
> >> implemented here proved to be more accurate.
> > ...
> >
> > I've done some experiments that measure short instruction latencies.
> > Basically I found:
> > 1) You need a suitable serialising instruction before and after
> >   the code being tested - otherwise it can overlap whatever
> >   you are using for timing.
> > 2) The only reliable counter is the performance monitor clock
> >   counter - everything else depends on the current cpu frequency.
> >   On intel cpu the cpu frequency can change all the time.
> > Allowing for that, and then ignoring complete outliers, I could
> > get clock-count accurate values for iterations of the IP csum loop.
> >
> >    David
> >
> > -
> > Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> > Registration No: 1397386 (Wales)
> >

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/2] x86/fpu: Measure the Latency of XSAVE and XRSTOR
  2022-07-24 20:54   ` David Laight
  2022-07-24 23:00     ` Luck, Tony
@ 2022-07-25 17:44     ` Dave Hansen
  2022-07-26  7:41       ` David Laight
  2022-07-26  9:03     ` 'Yi Sun'
  2 siblings, 1 reply; 10+ messages in thread
From: Dave Hansen @ 2022-07-25 17:44 UTC (permalink / raw)
  To: David Laight, 'Yi Sun', linux-kernel, x86
  Cc: sohil.mehta, tony.luck, heng.su

On 7/24/22 13:54, David Laight wrote:
> I've done some experiments that measure short instruction latencies.
> Basically I found:

Short?  The instructions in question can write up to about 12k of data.
 That's not "short" by any means.

I'm also not sure precision here is all that important.  The main things
we want to know here when and where the init and modified optimizations
are coming into play.  In other words, how often is there actual data
that *needs* to be saved and restored and can't be optimized away.

So, sure, if we were measuring a dozen cycles here, you could make an
argument that this _might_ be problematic.

But, in this case, we really just want to be able to tell when
XSAVE/XRSTOR are getting more or less expensive and also get out a
minimal amount of data (RFBM/XINUSE) to make a guess why that might be.

Is it *REALLY* worth throwing serializing instructions in and moving
clock sources to do that?  Is the added precision worth it?

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: [PATCH 1/2] x86/fpu: Measure the Latency of XSAVE and XRSTOR
  2022-07-25 17:44     ` Dave Hansen
@ 2022-07-26  7:41       ` David Laight
  0 siblings, 0 replies; 10+ messages in thread
From: David Laight @ 2022-07-26  7:41 UTC (permalink / raw)
  To: 'Dave Hansen', 'Yi Sun', linux-kernel, x86
  Cc: sohil.mehta, tony.luck, heng.su

From: Dave Hansen
> Sent: 25 July 2022 18:44
> 
> On 7/24/22 13:54, David Laight wrote:
> > I've done some experiments that measure short instruction latencies.
> > Basically I found:
> 
> Short?  The instructions in question can write up to about 12k of data.
>  That's not "short" by any means.
> 
> I'm also not sure precision here is all that important.  The main things
> we want to know here when and where the init and modified optimizations
> are coming into play.  In other words, how often is there actual data
> that *needs* to be saved and restored and can't be optimized away.
> 
> So, sure, if we were measuring a dozen cycles here, you could make an
> argument that this _might_ be problematic.
> 
> But, in this case, we really just want to be able to tell when
> XSAVE/XRSTOR are getting more or less expensive and also get out a
> minimal amount of data (RFBM/XINUSE) to make a guess why that might be.
> 
> Is it *REALLY* worth throwing serializing instructions in and moving
> clock sources to do that?  Is the added precision worth it?

I suspect that if you run your test in a loop the cpu will
be running at 800MHz for the first iteration but will soon
be running at 3GHz.
That is a 4-fold change in execution time if you use the TSC.

IIRC RDTSC is a serialising instruction (of some form), XSAVE
and XRSTOR may also be - so you probably are waiting for the
instruction to finish.
But you do actually need to ensure that is happening.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/2] x86/fpu: Measure the Latency of XSAVE and XRSTOR
  2022-07-24 20:54   ` David Laight
  2022-07-24 23:00     ` Luck, Tony
  2022-07-25 17:44     ` Dave Hansen
@ 2022-07-26  9:03     ` 'Yi Sun'
  2 siblings, 0 replies; 10+ messages in thread
From: 'Yi Sun' @ 2022-07-26  9:03 UTC (permalink / raw)
  To: David Laight
  Cc: dave.hansen, linux-kernel, x86, sohil.mehta, tony.luck, heng.su

On 24.07.2022 20:54, David Laight wrote:
>From: Yi Sun
>> Sent: 23 July 2022 09:38
>>
>> Calculate the latency of instructions xsave and xrstor with new trace
>> points x86_fpu_latency_xsave and x86_fpu_latency_xrstor.
>>
>> The delta TSC can be calculated within a single trace event. Another
>> option considered was to have 2 separated trace events marking the
>> start and finish of the xsave/xrstor instructions. The delta TSC was
>> calculated from the 2 trace points in user space, but there was
>> significant overhead added by the trace function itself.
>>
>> In internal testing, the single trace point option which is
>> implemented here proved to be more accurate.
>...
>
>I've done some experiments that measure short instruction latencies.
>Basically I found:
>1) You need a suitable serialising instruction before and after
>   the code being tested - otherwise it can overlap whatever
>   you are using for timing.

The original comments here are kindly not so exact. The patch actually
makes use of trace_clock to calculate the latency but not TSC directly.
The precision here is "at most ~1 jiffy between CPUs" which is probably
acceptable to this usage.
I would like refine the comments if it made confusing.

Thanks
   --Sun, Yi

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/2] x86/fpu: Measure the Latency of XSAVE and XRSTOR
  2022-07-23  8:37 ` [PATCH 1/2] x86/fpu: Measure the Latency of XSAVE and XRSTOR Yi Sun
  2022-07-24 20:54   ` David Laight
@ 2022-07-28 18:46   ` Sohil Mehta
  1 sibling, 0 replies; 10+ messages in thread
From: Sohil Mehta @ 2022-07-28 18:46 UTC (permalink / raw)
  To: Yi Sun; +Cc: tony.luck, heng.su, Dave Hansen, linux-kernel, x86

Some nits below:

On 7/23/2022 1:37 AM, Yi Sun wrote:
> Calculate the latency of instructions xsave and xrstor with new trace
> points x86_fpu_latency_xsave and x86_fpu_latency_xrstor.
> 
> The delta TSC can be calculated within a single trace event. 

s/can be/is

Another
> option considered was to have 2 separated trace events marking the
> start and finish of the xsave/xrstor instructions. The delta TSC was
> calculated from the 2 trace points in user space, but there was
> significant overhead added by the trace function itself.
> 

...

> 
> To enable it, CONFIG_X86_DEBUG_FPU and CONFIG_TRACEPOINTS are required.
> The compiler can get rid of all the extra crust when either of the two

s/can/will


> configs is disabled.
> 

> If both of the two configs are enabled, xsave/xrstor_tracing_enabled

s/two/

> would be reduced to a static check for tracing enabled. Thus, in the
> fast path there would be only 2 additional static checks.

...

> Leave the noise here instead of additional
> conditions before calling the x86_fpu_latency_* because that makes the
> latency more accurate and it's easy to filer the noise out by the

s/filer/filter

> following consuming script.

...

>   
> +DECLARE_EVENT_CLASS(x86_fpu_latency,
> +	TP_PROTO(struct fpstate *fpstate, u64 dtsc),

On x86, though TSC is the underlying counter for trace_clock(), it might 
be useful to keep this generic.

s/dtsc/latency

"latency" would match the printk below as well.

> +	TP_ARGS(fpstate, dtsc),
> +
> +	TP_STRUCT__entry(
> +		__field(struct fpstate *, fpstate)
> +		__field(u64, dtsc)
> +		__field(u64, rfbm)
> +		__field(u64, xinuse)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->fpstate = fpstate;
> +		__entry->dtsc = dtsc;
> +		__entry->rfbm = fpstate->xfeatures;
> +		__entry->xinuse = fpstate->regs.xsave.header.xfeatures;
> +	),
> +
> +	TP_printk("x86/fpu: latency:%lld RFBM:0x%llx XINUSE:0x%llx",
> +		__entry->dtsc,
> +		__entry->rfbm,
> +		__entry->xinuse
> +	)
> +);
> +

-Sohil

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2022-07-28 18:46 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-07-23  8:37 [PATCH 0/2] x86: Add xsave/xrstor Latency Trace Events and Consuming Script Yi Sun
2022-07-23  8:37 ` [PATCH 1/2] x86/fpu: Measure the Latency of XSAVE and XRSTOR Yi Sun
2022-07-24 20:54   ` David Laight
2022-07-24 23:00     ` Luck, Tony
2022-07-25  7:35       ` David Laight
2022-07-25 17:44     ` Dave Hansen
2022-07-26  7:41       ` David Laight
2022-07-26  9:03     ` 'Yi Sun'
2022-07-28 18:46   ` Sohil Mehta
2022-07-23  8:38 ` [PATCH 2/2] scripts/x86: Add script to consume trace log of xsave latency Yi Sun

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).