Re: [OE-core] [PATCH 1/2] qemu: add 34Kf-64tlb fictitious cpu type

From: "Victor Kamensky" <kamensky@cisco.com>
To: Khem Raj <raj.khem@gmail.com>,
	Alexander Kanavin <alex.kanavin@gmail.com>
Cc: OE-core <openembedded-core@lists.openembedded.org>,
	Ross Burton <ross@burtonini.com>
Subject: Re: [OE-core] [PATCH 1/2] qemu: add 34Kf-64tlb fictitious cpu type
Date: Thu, 8 Oct 2020 16:39:08 +0000	[thread overview]
Message-ID: <BYAPR11MB30476A66D1C6A0E1B1C3D6D6CD0B0@BYAPR11MB3047.namprd11.prod.outlook.com> (raw)
In-Reply-To: <CAMKF1srHSjgB2MGri=HqHfuzTUkzwPRqiLtoq6=-phDmRBH-UA@mail.gmail.com>

Hi Khem, Alexander,

Please response inline, look for 'kamensky>'

________________________________________
From: openembedded-core@lists.openembedded.org <openembedded-core@lists.openembedded.org> on behalf of Khem Raj <raj.khem@gmail.com>
Sent: Thursday, October 8, 2020 9:05 AM
To: Alexander Kanavin
Cc: Victor Kamensky (kamensky); OE-core; Ross Burton
Subject: Re: [OE-core] [PATCH 1/2] qemu: add 34Kf-64tlb fictitious cpu type

On Thu, Oct 8, 2020 at 4:53 AM Alexander Kanavin <alex.kanavin@gmail.com> wrote:
>
> Thanks - I note that Upstream-Status is missing, are you planning to approach qemu upstream with this?
>

Thinking about upstreaming, I think it might be worth proposing it upstream.

kamensky> Yes, the same was briefly discussed during today's big triage meeting:
kamensky> I will try to submit it to qemu upstream, and argue our case. it won't
kamensky> hurt to try it anyway.

kamensky> I as far as Upstream-Status concerned. Yes, it looks I've messed up,
kamensky> I've added 'Upstream Status' to OE patch. Did not realize that it should
kamensky> be added to the added patch itself.

Thanks,
Victor

> Alex
>
> On Thu, 8 Oct 2020 at 09:30, Ross Burton <ross@burtonini.com> wrote:
>>
>> Excellent work to identify a relatively simple way to dramatically
>> improve performance. Nice one!
>>
>> Ross
>>
>> On Wed, 7 Oct 2020 at 21:39, Victor Kamensky via
>> lists.openembedded.org <kamensky=cisco.com@lists.openembedded.org>
>> wrote:
>> >
>> > In Yocto Project PR 13992 it was reported that qemumips
>> > in autobuilder runs almost twice slower then qemumips64 and
>> > some times hit time out.
>> >
>> > Upon investigations of qemu-system with perf, gdb, and
>> > SystemTap and comparing qemumips and qemumips64 machines
>> > behavior it was noticed that qemu soft mmu code behaves
>> > quite different and in case if qemumips tlbwr instruction
>> > called 16 times more oftern. It happens that in qemumips64
>> > case qemu runs with cpu type that contains 64 TLB, but in case
>> > of qemumips qemu runs with cpu type that contains only
>> > 16 TLBs.
>> >
>> > The idea of proposed qemu patch is to introduce fictitious
>> > 34Kf-64tlb cpu type that defined exactly as 34Kf but has
>> > 64 TLBs, instead of original 16 TLBs.
>> >
>> > Testing of core-image-full-cmdline:do_testimage with
>> > 34Kf-64tlb shows 40% or so test execution real time
>> > improvement.
>> >
>> > Note for future porters of the patch: easiest way to update
>> > the patch and be in sync with 34Kf definition is to copy
>> > 34Kf machine definition and apply the following changes to
>> > it (just change 15 to 63 of CP0C1_MMU bits value)
>> >
>> > [kamensky@coreos-lnx2 qemu]$ diff ~/34Kf.c ~/34Kf-64tlb.c
>> > 2c2
>> > <         .name = "34Kf",
>> > >         .name = "34Kf-64tlb",
>> > 6c6
>> > <         .CP0_Config1 = MIPS_CONFIG1 | (1 << CP0C1_FP) | (15 << CP0C1_MMU) |
>> > >         .CP0_Config1 = MIPS_CONFIG1 | (1 << CP0C1_FP) | (63 << CP0C1_MMU) |
>> >
>> > Fixes https://bugzilla.yoctoproject.org/show_bug.cgi?id=13992
>> >
>> > Upstream Status: Inappropriate
>> >
>> > Signed-off-by: Victor Kamensky <kamensky@cisco.com>
>> > ---
>> >  meta/recipes-devtools/qemu/qemu.inc                |   1 +
>> >  ...Kf-64tlb-fictitious-cpu-type-like-34Kf-bu.patch | 118 +++++++++++++++++++++
>> >  2 files changed, 119 insertions(+)
>> >  create mode 100644 meta/recipes-devtools/qemu/qemu/0001-mips-add-34Kf-64tlb-fictitious-cpu-type-like-34Kf-bu.patch
>> >
>> > diff --git a/meta/recipes-devtools/qemu/qemu.inc b/meta/recipes-devtools/qemu/qemu.inc
>> > index bbb9038961..6c0edcb706 100644
>> > --- a/meta/recipes-devtools/qemu/qemu.inc
>> > +++ b/meta/recipes-devtools/qemu/qemu.inc
>> > @@ -31,6 +31,7 @@ SRC_URI = "https://download.qemu.org/${BPN}-${PV}.tar.xz \
>> >             file://0001-qemu-Do-not-include-file-if-not-exists.patch \
>> >             file://find_datadir.patch \
>> >             file://usb-fix-setup_len-init.patch \
>> > +           file://0001-mips-add-34Kf-64tlb-fictitious-cpu-type-like-34Kf-bu.patch \
>> >             "
>> >  UPSTREAM_CHECK_REGEX = "qemu-(?P<pver>\d+(\.\d+)+)\.tar"
>> >
>> > diff --git a/meta/recipes-devtools/qemu/qemu/0001-mips-add-34Kf-64tlb-fictitious-cpu-type-like-34Kf-bu.patch b/meta/recipes-devtools/qemu/qemu/0001-mips-add-34Kf-64tlb-fictitious-cpu-type-like-34Kf-bu.patch
>> > new file mode 100644
>> > index 0000000000..b6312e1543
>> > --- /dev/null
>> > +++ b/meta/recipes-devtools/qemu/qemu/0001-mips-add-34Kf-64tlb-fictitious-cpu-type-like-34Kf-bu.patch
>> > @@ -0,0 +1,118 @@
>> > +From b3fcc7d96523ad8e3ea28c09d495ef08529d01ce Mon Sep 17 00:00:00 2001
>> > +From: Victor Kamensky <kamensky@cisco.com>
>> > +Date: Wed, 7 Oct 2020 10:19:42 -0700
>> > +Subject: [PATCH] mips: add 34Kf-64tlb fictitious cpu type like 34Kf but with
>> > + 64 TLBs
>> > +
>> > +In Yocto Project CI runs it was observed that test run
>> > +of 32 bit mips image takes almost twice longer than 64 bit
>> > +mips image with the same logical load and CI execution
>> > +hits timeout.
>> > +
>> > +See https://bugzilla.yoctoproject.org/show_bug.cgi?id=13992
>> > +
>> > +Yocto project uses 34Kf cpu type to run 32 bit mips image,
>> > +and MIPS64R2-generic cpu type to run 64 bit mips64 image.
>> > +
>> > +Upon qemu behavior differences investigation between mips
>> > +and mips64 two prominent observations came up: under
>> > +logically similar load (same definition and configuration
>> > +of user-land image) in case of mips get_physical_address
>> > +function is called almost twice more often, meaning
>> > +twice more memory accesses involved in this case. Also
>> > +number of tlbwr instruction executed (r4k_helper_tlbwr
>> > +qemu function) almost 16 time bigger in mips case than in
>> > +mips64.
>> > +
>> > +It turns out that 34Kf cpu has 16 TLBs, but in case of
>> > +MIPS64R2-generic it is 64 TLBs. So that explains why
>> > +some many more tlbwr had to be execute by kernel TLB refill
>> > +handler in case of 32 bit misp.
>> > +
>> > +The idea of the fix is to come up with new 34Kf-64tlb fictitious
>> > +cpu type, that would behave exactly as 34Kf but it would
>> > +contain 64 TLBs to reduce TLB trashing. After all, adding
>> > +more TLBs to soft mmu is easy.
>> > +
>> > +Experiment with some significant non-trvial load in Yocto
>> > +environment by running do_testimage load shows that 34Kf-64tlb
>> > +cpu performs 40% or so better than original 34Kf cpu wrt test
>> > +execution real time.
>> > +
>> > +It is not ideal to have cpu type that does not exist in the
>> > +wild but given performance gains it seems to be justified.
>> > +
>> > +Signed-off-by: Victor Kamensky <kamensky@cisco.com>
>> > +---
>> > + target/mips/translate_init.inc.c | 55 ++++++++++++++++++++++++++++++++++++++++
>> > + 1 file changed, 55 insertions(+)
>> > +
>> > +diff --git a/target/mips/translate_init.inc.c b/target/mips/translate_init.inc.c
>> > +index 637caccd89..b73ab48231 100644
>> > +--- a/target/mips/translate_init.inc.c
>> > ++++ b/target/mips/translate_init.inc.c
>> > +@@ -297,6 +297,61 @@ const mips_def_t mips_defs[] =
>> > +         .insn_flags = CPU_MIPS32R2 | ASE_MIPS16 | ASE_DSP | ASE_MT,
>> > +         .mmu_type = MMU_TYPE_R4000,
>> > +     },
>> > ++    /*
>> > ++     * Verbatim copy of "34Kf" cpu, only bumped up number of TLB entries
>> > ++     * from 16 to 64 (see CP0_Config0 value at CP0C1_MMU bits) to improve
>> > ++     * performance by reducing number of TLB refill exceptions and
>> > ++     * eliminating need to run all corresponding TLB refill handling
>> > ++     * instructions.
>> > ++     */
>> > ++    {
>> > ++        .name = "34Kf-64tlb",
>> > ++        .CP0_PRid = 0x00019500,
>> > ++        .CP0_Config0 = MIPS_CONFIG0 | (0x1 << CP0C0_AR) |
>> > ++                       (MMU_TYPE_R4000 << CP0C0_MT),
>> > ++        .CP0_Config1 = MIPS_CONFIG1 | (1 << CP0C1_FP) | (63 << CP0C1_MMU) |
>> > ++                       (0 << CP0C1_IS) | (3 << CP0C1_IL) | (1 << CP0C1_IA) |
>> > ++                       (0 << CP0C1_DS) | (3 << CP0C1_DL) | (1 << CP0C1_DA) |
>> > ++                       (1 << CP0C1_CA),
>> > ++        .CP0_Config2 = MIPS_CONFIG2,
>> > ++        .CP0_Config3 = MIPS_CONFIG3 | (1 << CP0C3_VInt) | (1 << CP0C3_MT) |
>> > ++                       (1 << CP0C3_DSPP),
>> > ++        .CP0_LLAddr_rw_bitmask = 0,
>> > ++        .CP0_LLAddr_shift = 0,
>> > ++        .SYNCI_Step = 32,
>> > ++        .CCRes = 2,
>> > ++        .CP0_Status_rw_bitmask = 0x3778FF1F,
>> > ++        .CP0_TCStatus_rw_bitmask = (0 << CP0TCSt_TCU3) | (0 << CP0TCSt_TCU2) |
>> > ++                    (1 << CP0TCSt_TCU1) | (1 << CP0TCSt_TCU0) |
>> > ++                    (0 << CP0TCSt_TMX) | (1 << CP0TCSt_DT) |
>> > ++                    (1 << CP0TCSt_DA) | (1 << CP0TCSt_A) |
>> > ++                    (0x3 << CP0TCSt_TKSU) | (1 << CP0TCSt_IXMT) |
>> > ++                    (0xff << CP0TCSt_TASID),
>> > ++        .CP1_fcr0 = (1 << FCR0_F64) | (1 << FCR0_L) | (1 << FCR0_W) |
>> > ++                    (1 << FCR0_D) | (1 << FCR0_S) | (0x95 << FCR0_PRID),
>> > ++        .CP1_fcr31 = 0,
>> > ++        .CP1_fcr31_rw_bitmask = 0xFF83FFFF,
>> > ++        .CP0_SRSCtl = (0xf << CP0SRSCtl_HSS),
>> > ++        .CP0_SRSConf0_rw_bitmask = 0x3fffffff,
>> > ++        .CP0_SRSConf0 = (1U << CP0SRSC0_M) | (0x3fe << CP0SRSC0_SRS3) |
>> > ++                    (0x3fe << CP0SRSC0_SRS2) | (0x3fe << CP0SRSC0_SRS1),
>> > ++        .CP0_SRSConf1_rw_bitmask = 0x3fffffff,
>> > ++        .CP0_SRSConf1 = (1U << CP0SRSC1_M) | (0x3fe << CP0SRSC1_SRS6) |
>> > ++                    (0x3fe << CP0SRSC1_SRS5) | (0x3fe << CP0SRSC1_SRS4),
>> > ++        .CP0_SRSConf2_rw_bitmask = 0x3fffffff,
>> > ++        .CP0_SRSConf2 = (1U << CP0SRSC2_M) | (0x3fe << CP0SRSC2_SRS9) |
>> > ++                    (0x3fe << CP0SRSC2_SRS8) | (0x3fe << CP0SRSC2_SRS7),
>> > ++        .CP0_SRSConf3_rw_bitmask = 0x3fffffff,
>> > ++        .CP0_SRSConf3 = (1U << CP0SRSC3_M) | (0x3fe << CP0SRSC3_SRS12) |
>> > ++                    (0x3fe << CP0SRSC3_SRS11) | (0x3fe << CP0SRSC3_SRS10),
>> > ++        .CP0_SRSConf4_rw_bitmask = 0x3fffffff,
>> > ++        .CP0_SRSConf4 = (0x3fe << CP0SRSC4_SRS15) |
>> > ++                    (0x3fe << CP0SRSC4_SRS14) | (0x3fe << CP0SRSC4_SRS13),
>> > ++        .SEGBITS = 32,
>> > ++        .PABITS = 32,
>> > ++        .insn_flags = CPU_MIPS32R2 | ASE_MIPS16 | ASE_DSP | ASE_MT,
>> > ++        .mmu_type = MMU_TYPE_R4000,
>> > ++    },
>> > +     {
>> > +         .name = "74Kf",
>> > +         .CP0_PRid = 0x00019700,
>> > +--
>> > +2.14.5
>> > +
>> > --
>> > 2.14.5
>> >
>> >
>> >
>> >
>>
>>
>>
>
>
>