Re: [PATCH 2/2] x86/tsc_sync: Add synchronization overhead to tsc adjustment

From: Waiman Long <longman@redhat.com>
To: Thomas Gleixner <tglx@linutronix.de>,
	Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>,
	Dave Hansen <dave.hansen@linux.intel.com>
Cc: x86@kernel.org, linux-kernel@vger.kernel.org,
	"H. Peter Anvin" <hpa@zytor.com>, Feng Tang <feng.tang@intel.com>,
	Bill Gray <bgray@redhat.com>, Jirka Hladky <jhladky@redhat.com>
Subject: Re: [PATCH 2/2] x86/tsc_sync: Add synchronization overhead to tsc adjustment
Date: Tue, 26 Apr 2022 11:36:07 -0400	[thread overview]
Message-ID: <68837b1a-f85b-e842-f8c0-1cad162856f4@redhat.com> (raw)
In-Reply-To: <87czh50xwf.ffs@tglx>

On 4/25/22 15:24, Thomas Gleixner wrote:
> On Mon, Apr 25 2022 at 09:20, Waiman Long wrote:
>> On 4/22/22 06:41, Thomas Gleixner wrote:
>>> I did some experiments and noticed that the boot time overhead is
>>> different from the overhead when doing the sync check after boot
>>> (offline a socket and on/offline the first CPU of it several times).
>>>
>>> During boot the overhead is lower on this machine (SKL-X), during
>>> runtime it's way higher and more noisy.
>>>
>>> The noise can be pretty much eliminated by running the sync_overhead
>>> measurement multiple times and building the average.
>>>
>>> The reason why it is higher is that after offlining the socket the CPU
>>> comes back up with a frequency of 700Mhz while during boot it runs with
>>> 2100Mhz.
>>>
>>> Sync overhead: 118
>>> Sync overhead:  51 A: 22466 M: 22448 F: 2101683
>> One explanation of the sync overhead difference (118 vs 51) here is
>> whether the lock cacheline is local or remote. My analysis the
>> interaction between check_tsc_sync_source() and check_tsc_sync_target()
>> is that real overhead is about locking with remote cacheline (local to
>> source, remote to target). When you do a 256 loop of locking, it is all
>> local cacheline. That is why the overhead is lower. It also depends on
>> if the remote cacheline is in the same socket or a different socket.
> Yes. It's clear that the initial sync overhead is due to the cache line
> being remote, but I rather underestimate the compensation. Aside of that
> it's not guaranteed that the cache line is actually remote on the first
> access. It's by chance, but not by design.

In check_tsc_warp(), the (unlikely(prev > now) check may only be 
triggered to record the possible wrap if last_tsc was previously written 
to by another cpu. That requires the transfer of lock cacheline from the 
remote cpu to local cpu as well. So sync overhead with remote cacheline 
is what really matters here. I had actually thought about just measuring 
local cacheline sync overhead so as to underestimate it and I am fine 
about doing it.

>>> Sync overhead: 178
>>> Sync overhead: 152 A: 22477 M: 67380 F:  700529
>>>
>>> Sync overhead: 212
>>> Sync overhead: 152 A: 22475 M: 67380 F:  700467
>>>
>>> Sync overhead: 153
>>> Sync overhead: 152 A: 22497 M: 67452 F:  700404
>>>
>>> Can you try the patch below and check whether the overhead stabilizes
>>> accross several attempts on that copperlake machine and whether the
>>> frequency is always the same or varies?
>> Yes, I will try that experiment and report back the results.
>>> Independent of the outcome on that, I think have to take the actual CPU
>>> frequency into account for calculating the overhead.
>> Assuming that the clock frequency remains the same during the
>> check_tsc_warp() loop and the sync overhead computation time, I don't
>> think the actual clock frequency matters much. However, it will be a
>> different matter if the frequency does change. In this case, it is more
>> likely the frequency will go up than down. Right? IOW, we may
>> underestimate the sync overhead in this case. I think it is better than
>> overestimating it.
> The question is not whether the clock frequency changes during the loop.
> The point is:
>
>      start = rdtsc();
>      do_stuff();
>      end = rdtsc();
>      compensation = end - start;
>      
> do_stuff() executes a constant number of instructions which are executed
> in a constant number of CPU clock cycles, let's say 100 for simplicity.
> TSC runs with 2000MHz.
>
> With a CPU frequency of 1000 MHz the real computation time is:
>
>     100/1000MHz = 100 nsec = 200 TSC cycles
>
> while with a CPU frequency of 2000MHz it is obviously:
>
>     100/2000MHz =  50 nsec = 100 TSC cyles
>
> IOW, TSC runs with a constant frequency independent of the actual CPU
> frequency, ergo the CPU frequency dependent execution time has an
> influence on the resulting compensation value, no?
>
> On the machine I tested on, it's a factor of 3 between the minimal and
> the maximal CPU frequency, which makes quite a difference, right?

Yes, I understand that. The measurement of sync_overhead is for 
estimating the delay (in TSC cycles) that the locking overhead 
introduces. With 1000MHz frequency, the delay in TSC cycle will be 
double that of a cpu running at 2000MHz. So you need more compensation 
in this case. That is why I said that as long as clock frequency doesn't 
change in the check_tsc_wrap() loop and the sync_overhead measurement 
part of the code, the actual cpu frequency does not matter here.

However about we half the measure sync_overhead as compensation to avoid 
over-estimation, but probably increase the chance that we need a second 
adjustment of TSC wrap.

Cheers,
Longman