Re: [PATCH] cache: Workaround HiSilicon Taishan DC CVAU

From: chenweilong <chenweilong@huawei.com>
To: Will Deacon <will@kernel.org>
Cc: <catalin.marinas@arm.com>, <corbet@lwn.net>,
	<linux-kernel@vger.kernel.org>, <linux-doc@vger.kernel.org>
Subject: Re: [PATCH] cache: Workaround HiSilicon Taishan DC CVAU
Date: Wed, 29 Dec 2021 11:11:49 +0800	[thread overview]
Message-ID: <d8e62725-5597-7cc5-b862-db0cb5564bba@huawei.com> (raw)
In-Reply-To: <20211213185643.GA12717@willie-the-truck>

On 2021/12/14 2:56, Will Deacon wrote:
> On Fri, Nov 26, 2021 at 05:11:39PM +0800, Weilong Chen wrote:
>> Taishan's L1/L2 cache is inclusive, and the data is consistent.
>> Any change of L1 does not require DC operation to brush CL in L1 to L2.
>> It's safe that don't clean data cache by address to point of unification.
>>
>> Without IDC featrue, kernel needs to flush icache as well as dcache,
>> causes performance degradation.
>>
>> The flaw refers to V110/V200 variant 1.
>>
>> Signed-off-by: Weilong Chen <chenweilong@huawei.com>
>> ---
>>  Documentation/arm64/silicon-errata.rst |  2 ++
>>  arch/arm64/Kconfig                     | 11 +++++++++
>>  arch/arm64/include/asm/cputype.h       |  2 ++
>>  arch/arm64/kernel/cpu_errata.c         | 32 ++++++++++++++++++++++++++
>>  arch/arm64/tools/cpucaps               |  1 +
>>  5 files changed, 48 insertions(+)
> Hmm. We don't usually apply optimisations for specific CPUs on arm64, simply
> because the diversity of CPUs out there means it quickly becomes a
> fragmented mess.
>
> Is this patch purely a performance improvement? If so, please can you
> provide some numbers in an attempt to justify it?

Yes，it's a performance improvement. I have a test program like this:

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <errno.h>
#include <unistd.h>
#include <sys/mman.h>
#include <sys/time.h>

int main()
{
        void *tmp;
        int len = 200 * 1024 * 1024;
        struct timeval start, end;
        int interval;
        tmp = mmap(NULL, len, PROT_READ|PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
        if(tmp == MAP_FAILED) {
                perror("mmap failed");
                exit(errno);
        }
        memset(tmp, 0, len);

        gettimeofday(&start, NULL);
        if(mprotect(tmp, len, PROT_READ|PROT_EXEC)) {
                perror("Couldn’t mprotect");
                exit(errno);
        }
        gettimeofday(&end, NULL);
        interval = 1000000*(end.tv_sec - start.tv_sec) + (end.tv_usec - start.tv_usec);
        printf("interval = %fms\n", interval/1000.0);
}

Without this fix, the mprotect takes:

interval = 25.608000ms

And with this fix:

interval = 0.689000ms

Have better performance improvement.

If you think it is suitable, I will send a v2 patch as the original patch broken cpu hotplug checks.

>
> Thanks,
>
> Will
> .