All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: [PATCH 02/14] sysbus: Remove sysbus_address_space
       [not found] ` <20220623102617.2164175-3-pdel@fb.com>
@ 2022-06-23 12:11   ` Peter Maydell
  0 siblings, 0 replies; 19+ messages in thread
From: Peter Maydell @ 2022-06-23 12:11 UTC (permalink / raw)
  To: Peter Delevoryas
  Cc: clg, andrew, joel, pbonzini, berrange, eduardo, marcel.apfelbaum,
	richard.henderson, f4bug, ani, qemu-devel, qemu-arm, kvm

On Thu, 23 Jun 2022 at 11:56, Peter Delevoryas <pdel@fb.com> wrote:
>
> sysbus_address_space returns the address space associated with a
> SysBusDevice.
>
> That address space is always the global singleton "system_memory", which
> is retrieved through get_system_memory().
>
> This abstraction isn't very useful. Users of the sysbus API (machine
> authors) should know that sysbus_mmio_map et al. are mapping devices
> into the global singleton memory region, not into a specific container
> or some memory region specific to the device's parent bus.
>
> Lastly, only a few uses of this function exist. They can all be
> refactored to just use get_system_memory() directly.

Yeah, we definitely don't need two functions doing the same
thing here.

Reviewed-by: Peter Maydell <peter.maydell@linaro.org>

thanks
-- PMM

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 04/14] sysbus: Add sysbus_mmio_map_in
       [not found] ` <20220623102617.2164175-5-pdel@fb.com>
@ 2022-06-23 12:15   ` Peter Maydell
  2022-06-23 18:29       ` Peter Delevoryas
  0 siblings, 1 reply; 19+ messages in thread
From: Peter Maydell @ 2022-06-23 12:15 UTC (permalink / raw)
  To: Peter Delevoryas
  Cc: clg, andrew, joel, pbonzini, berrange, eduardo, marcel.apfelbaum,
	richard.henderson, f4bug, ani, qemu-devel, qemu-arm, kvm

On Thu, 23 Jun 2022 at 11:56, Peter Delevoryas <pdel@fb.com> wrote:
>
> Signed-off-by: Peter Delevoryas <pdel@fb.com>
> ---
>  hw/core/sysbus.c    | 6 ++++++
>  include/hw/sysbus.h | 2 ++
>  2 files changed, 8 insertions(+)
>
> diff --git a/hw/core/sysbus.c b/hw/core/sysbus.c
> index cb4d6bae9d..7b63ec3fed 100644
> --- a/hw/core/sysbus.c
> +++ b/hw/core/sysbus.c
> @@ -160,6 +160,12 @@ void sysbus_mmio_map(SysBusDevice *dev, int n, hwaddr addr)
>      sysbus_mmio_map_common(dev, n, addr, false, 0, get_system_memory());
>  }
>
> +void sysbus_mmio_map_in(SysBusDevice *dev, int n, hwaddr addr,
> +                        MemoryRegion *system_memory)
> +{
> +    sysbus_mmio_map_common(dev, n, addr, false, 0, system_memory);
> +}
> +
>  void sysbus_mmio_map_overlap(SysBusDevice *dev, int n, hwaddr addr,
>                               int priority)
>  {
> diff --git a/include/hw/sysbus.h b/include/hw/sysbus.h
> index a7c23d5fb1..f4578029e4 100644
> --- a/include/hw/sysbus.h
> +++ b/include/hw/sysbus.h
> @@ -80,6 +80,8 @@ void sysbus_connect_irq(SysBusDevice *dev, int n, qemu_irq irq);
>  bool sysbus_is_irq_connected(SysBusDevice *dev, int n);
>  qemu_irq sysbus_get_connected_irq(SysBusDevice *dev, int n);
>  void sysbus_mmio_map(SysBusDevice *dev, int n, hwaddr addr);
> +void sysbus_mmio_map_in(SysBusDevice *dev, int n, hwaddr addr,
> +                        MemoryRegion *system_memory);
>  void sysbus_mmio_map_overlap(SysBusDevice *dev, int n, hwaddr addr,
>                               int priority);

What's this going to be used for?

The current standard way to map a sysbus MMIO region into something
other than the global system memory region is to do:
   memory_region_add_subregion(&container, addr,
                               sysbus_mmio_get_region(sbd, 0));

I'd rather not have two ways to do the same thing; we have
far too many of those already.

thanks
-- PMM

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 08/14] aspeed: Replace direct get_system_memory() calls
       [not found] ` <20220623102617.2164175-9-pdel@fb.com>
@ 2022-06-23 12:57   ` Peter Maydell
  2022-06-23 15:39     ` Cédric Le Goater
  0 siblings, 1 reply; 19+ messages in thread
From: Peter Maydell @ 2022-06-23 12:57 UTC (permalink / raw)
  To: Peter Delevoryas
  Cc: clg, andrew, joel, pbonzini, berrange, eduardo, marcel.apfelbaum,
	richard.henderson, f4bug, ani, qemu-devel, qemu-arm, kvm

On Thu, 23 Jun 2022 at 13:37, Peter Delevoryas <pdel@fb.com> wrote:
>
> Note: sysbus_mmio_map(), sysbus_mmio_map_overlap(), and others are still
> using get_system_memory indirectly.
>
> Signed-off-by: Peter Delevoryas <pdel@fb.com>
> ---
>  hw/arm/aspeed.c         | 8 ++++----
>  hw/arm/aspeed_ast10x0.c | 5 ++---
>  hw/arm/aspeed_ast2600.c | 2 +-
>  hw/arm/aspeed_soc.c     | 6 +++---
>  4 files changed, 10 insertions(+), 11 deletions(-)
>
> diff --git a/hw/arm/aspeed.c b/hw/arm/aspeed.c
> index 8dae155183..3aa74e88fb 100644
> --- a/hw/arm/aspeed.c
> +++ b/hw/arm/aspeed.c
> @@ -371,7 +371,7 @@ static void aspeed_machine_init(MachineState *machine)
>                           amc->uart_default);
>      qdev_realize(DEVICE(&bmc->soc), NULL, &error_abort);
>
> -    memory_region_add_subregion(get_system_memory(),
> +    memory_region_add_subregion(bmc->soc.system_memory,
>                                  sc->memmap[ASPEED_DEV_SDRAM],
>                                  &bmc->ram_container);

This is board code, it shouldn't be reaching into the internals
of the SoC object like this. The board code probably already
has the right MemoryRegion because it was the one that passed
it to the SoC link porperty in the first place.

thanks
-- PMM

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 12/14] aspeed: Make aspeed_board_init_flashes public
       [not found] ` <20220623102617.2164175-13-pdel@fb.com>
@ 2022-06-23 15:09   ` Cédric Le Goater
  2022-06-23 18:43       ` Peter Delevoryas
  0 siblings, 1 reply; 19+ messages in thread
From: Cédric Le Goater @ 2022-06-23 15:09 UTC (permalink / raw)
  To: Peter Delevoryas
  Cc: peter.maydell, andrew, joel, pbonzini, berrange, eduardo,
	i.mitsyanko@gmail.com.mst, marcel.apfelbaum, richard.henderson,
	f4bug, ani, qemu-devel, qemu-arm, kvm

On 6/23/22 12:26, Peter Delevoryas wrote:
> Signed-off-by: Peter Delevoryas <pdel@fb.com>

Let's start simple without flash support. We should be able to
load FW blobs in each CPU address space using loader devices.

Thanks,

C.

> ---
>   hw/arm/aspeed.c             | 25 -------------------------
>   hw/arm/aspeed_soc.c         | 26 ++++++++++++++++++++++++++
>   include/hw/arm/aspeed_soc.h |  2 ++
>   3 files changed, 28 insertions(+), 25 deletions(-)
> 
> diff --git a/hw/arm/aspeed.c b/hw/arm/aspeed.c
> index 3aa74e88fb..c893ce84d7 100644
> --- a/hw/arm/aspeed.c
> +++ b/hw/arm/aspeed.c
> @@ -278,31 +278,6 @@ static void write_boot_rom(DriveInfo *dinfo, hwaddr addr, size_t rom_size,
>       rom_add_blob_fixed("aspeed.boot_rom", storage, rom_size, addr);
>   }
>   
> -static void aspeed_board_init_flashes(AspeedSMCState *s, const char *flashtype,
> -                                      unsigned int count, int unit0)
> -{
> -    int i;
> -
> -    if (!flashtype) {
> -        return;
> -    }
> -
> -    for (i = 0; i < count; ++i) {
> -        DriveInfo *dinfo = drive_get(IF_MTD, 0, unit0 + i);
> -        qemu_irq cs_line;
> -        DeviceState *dev;
> -
> -        dev = qdev_new(flashtype);
> -        if (dinfo) {
> -            qdev_prop_set_drive(dev, "drive", blk_by_legacy_dinfo(dinfo));
> -        }
> -        qdev_realize_and_unref(dev, BUS(s->spi), &error_fatal);
> -
> -        cs_line = qdev_get_gpio_in_named(dev, SSI_GPIO_CS, 0);
> -        sysbus_connect_irq(SYS_BUS_DEVICE(s), i + 1, cs_line);
> -    }
> -}
> -
>   static void sdhci_attach_drive(SDHCIState *sdhci, DriveInfo *dinfo)
>   {
>           DeviceState *card;
> diff --git a/hw/arm/aspeed_soc.c b/hw/arm/aspeed_soc.c
> index b7e8506f28..33bfc06ed8 100644
> --- a/hw/arm/aspeed_soc.c
> +++ b/hw/arm/aspeed_soc.c
> @@ -20,6 +20,7 @@
>   #include "hw/i2c/aspeed_i2c.h"
>   #include "net/net.h"
>   #include "sysemu/sysemu.h"
> +#include "sysemu/blockdev.h"
>   
>   #define ASPEED_SOC_IOMEM_SIZE       0x00200000
>   
> @@ -579,3 +580,28 @@ void aspeed_soc_uart_init(AspeedSoCState *s)
>                          serial_hd(i), DEVICE_LITTLE_ENDIAN);
>       }
>   }
> +
> +void aspeed_board_init_flashes(AspeedSMCState *s, const char *flashtype,
> +                               unsigned int count, int unit0)
> +{
> +    int i;
> +
> +    if (!flashtype) {
> +        return;
> +    }
> +
> +    for (i = 0; i < count; ++i) {
> +        DriveInfo *dinfo = drive_get(IF_MTD, 0, unit0 + i);
> +        qemu_irq cs_line;
> +        DeviceState *dev;
> +
> +        dev = qdev_new(flashtype);
> +        if (dinfo) {
> +            qdev_prop_set_drive(dev, "drive", blk_by_legacy_dinfo(dinfo));
> +        }
> +        qdev_realize_and_unref(dev, BUS(s->spi), &error_fatal);
> +
> +        cs_line = qdev_get_gpio_in_named(dev, SSI_GPIO_CS, 0);
> +        sysbus_connect_irq(SYS_BUS_DEVICE(s), i + 1, cs_line);
> +    }
> +}
> diff --git a/include/hw/arm/aspeed_soc.h b/include/hw/arm/aspeed_soc.h
> index c68395ddbb..270d85d5de 100644
> --- a/include/hw/arm/aspeed_soc.h
> +++ b/include/hw/arm/aspeed_soc.h
> @@ -166,5 +166,7 @@ enum {
>   
>   qemu_irq aspeed_soc_get_irq(AspeedSoCState *s, int dev);
>   void aspeed_soc_uart_init(AspeedSoCState *s);
> +void aspeed_board_init_flashes(AspeedSMCState *s, const char *flashtype,
> +                               unsigned int count, int unit0);
>   
>   #endif /* ASPEED_SOC_H */


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 08/14] aspeed: Replace direct get_system_memory() calls
  2022-06-23 12:57   ` [PATCH 08/14] aspeed: Replace direct get_system_memory() calls Peter Maydell
@ 2022-06-23 15:39     ` Cédric Le Goater
  2022-06-23 18:45         ` Peter Delevoryas
  0 siblings, 1 reply; 19+ messages in thread
From: Cédric Le Goater @ 2022-06-23 15:39 UTC (permalink / raw)
  To: Peter Maydell, Peter Delevoryas
  Cc: andrew, joel, pbonzini, berrange, eduardo, marcel.apfelbaum,
	richard.henderson, f4bug, ani, qemu-devel, qemu-arm, kvm

On 6/23/22 14:57, Peter Maydell wrote:
> On Thu, 23 Jun 2022 at 13:37, Peter Delevoryas <pdel@fb.com> wrote:
>>
>> Note: sysbus_mmio_map(), sysbus_mmio_map_overlap(), and others are still
>> using get_system_memory indirectly.
>>
>> Signed-off-by: Peter Delevoryas <pdel@fb.com>
>> ---
>>   hw/arm/aspeed.c         | 8 ++++----
>>   hw/arm/aspeed_ast10x0.c | 5 ++---
>>   hw/arm/aspeed_ast2600.c | 2 +-
>>   hw/arm/aspeed_soc.c     | 6 +++---
>>   4 files changed, 10 insertions(+), 11 deletions(-)
>>
>> diff --git a/hw/arm/aspeed.c b/hw/arm/aspeed.c
>> index 8dae155183..3aa74e88fb 100644
>> --- a/hw/arm/aspeed.c
>> +++ b/hw/arm/aspeed.c
>> @@ -371,7 +371,7 @@ static void aspeed_machine_init(MachineState *machine)
>>                            amc->uart_default);
>>       qdev_realize(DEVICE(&bmc->soc), NULL, &error_abort);
>>
>> -    memory_region_add_subregion(get_system_memory(),
>> +    memory_region_add_subregion(bmc->soc.system_memory,
>>                                   sc->memmap[ASPEED_DEV_SDRAM],
>>                                   &bmc->ram_container);
> 
> This is board code, it shouldn't be reaching into the internals
> of the SoC object like this. The board code probably already
> has the right MemoryRegion because it was the one that passed
> it to the SoC link porperty in the first place.

It's a bit messy currently. May be I got it wrong initially. The
board allocates a ram container region in which the machine ram
region is mapped. See commit ad1a9782186d ("aspeed: add a RAM memory
region container")

There is an extra region after ram in the ram container to catch
invalid access done by FW. That's how FW determines the size of
ram. See commit ebe31c0a8ef7 ("aspeed: add a max_ram_size property
to the memory controller")

But I think I can safely move all the RAM logic under the board.
I will send a patch.

Thanks,

C.

> 
> thanks
> -- PMM


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 04/14] sysbus: Add sysbus_mmio_map_in
  2022-06-23 12:15   ` [PATCH 04/14] sysbus: Add sysbus_mmio_map_in Peter Maydell
@ 2022-06-23 18:29       ` Peter Delevoryas
  0 siblings, 0 replies; 19+ messages in thread
From: Peter Delevoryas @ 2022-06-23 18:29 UTC (permalink / raw)
  Cc: Peter Delevoryas, Cédric Le Goater, Andrew Jeffery,
	Joel Stanley, pbonzini, berrange, eduardo, marcel.apfelbaum,
	richard.henderson, Philippe Mathieu-Daudé,
	ani, Cameron Esfahani via, qemu-arm, kvm, Peter Maydell



> On Jun 23, 2022, at 5:15 AM, Peter Maydell <peter.maydell@linaro.org> wrote:
> 
> On Thu, 23 Jun 2022 at 11:56, Peter Delevoryas <pdel@fb.com> wrote:
>> 
>> Signed-off-by: Peter Delevoryas <pdel@fb.com>
>> ---
>> hw/core/sysbus.c    | 6 ++++++
>> include/hw/sysbus.h | 2 ++
>> 2 files changed, 8 insertions(+)
>> 
>> diff --git a/hw/core/sysbus.c b/hw/core/sysbus.c
>> index cb4d6bae9d..7b63ec3fed 100644
>> --- a/hw/core/sysbus.c
>> +++ b/hw/core/sysbus.c
>> @@ -160,6 +160,12 @@ void sysbus_mmio_map(SysBusDevice *dev, int n, hwaddr addr)
>>     sysbus_mmio_map_common(dev, n, addr, false, 0, get_system_memory());
>> }
>> 
>> +void sysbus_mmio_map_in(SysBusDevice *dev, int n, hwaddr addr,
>> +                        MemoryRegion *system_memory)
>> +{
>> +    sysbus_mmio_map_common(dev, n, addr, false, 0, system_memory);
>> +}
>> +
>> void sysbus_mmio_map_overlap(SysBusDevice *dev, int n, hwaddr addr,
>>                              int priority)
>> {
>> diff --git a/include/hw/sysbus.h b/include/hw/sysbus.h
>> index a7c23d5fb1..f4578029e4 100644
>> --- a/include/hw/sysbus.h
>> +++ b/include/hw/sysbus.h
>> @@ -80,6 +80,8 @@ void sysbus_connect_irq(SysBusDevice *dev, int n, qemu_irq irq);
>> bool sysbus_is_irq_connected(SysBusDevice *dev, int n);
>> qemu_irq sysbus_get_connected_irq(SysBusDevice *dev, int n);
>> void sysbus_mmio_map(SysBusDevice *dev, int n, hwaddr addr);
>> +void sysbus_mmio_map_in(SysBusDevice *dev, int n, hwaddr addr,
>> +                        MemoryRegion *system_memory);
>> void sysbus_mmio_map_overlap(SysBusDevice *dev, int n, hwaddr addr,
>>                              int priority);
> 
> What's this going to be used for?

At the moment, just for SoC's to map peripherals into its memory region.

> 
> The current standard way to map a sysbus MMIO region into something
> other than the global system memory region is to do:
>   memory_region_add_subregion(&container, addr,
>                               sysbus_mmio_get_region(sbd, 0));

Oh!! Wait a minute. I should just change the Aspeed SoC code to
do that. I was kinda thinking that we wanted all the SoC code
to go through sysbus_mmio_map_common, but I realize now that
it’s really just a small helper that does some error checking,
and we can just implement that in an aspeed-specific helper if
necessary.

> 
> I'd rather not have two ways to do the same thing; we have
> far too many of those already.

Totally agree, I wasn’t very confident in this patch anyways.
I’ll remove it from the series and refactor the SoC board
patch following this to do what you suggested.

Thanks!!
Peter

> 
> thanks
> -- PMM


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 04/14] sysbus: Add sysbus_mmio_map_in
@ 2022-06-23 18:29       ` Peter Delevoryas
  0 siblings, 0 replies; 19+ messages in thread
From: Peter Delevoryas @ 2022-06-23 18:29 UTC (permalink / raw)
  Cc: Peter Delevoryas, Cédric Le Goater, Andrew Jeffery,
	Joel Stanley, pbonzini, berrange, eduardo, marcel.apfelbaum,
	richard.henderson, Philippe Mathieu-Daudé,
	ani, Cameron Esfahani via, qemu-arm, kvm, Peter Maydell



> On Jun 23, 2022, at 5:15 AM, Peter Maydell <peter.maydell@linaro.org> wrote:
> 
> On Thu, 23 Jun 2022 at 11:56, Peter Delevoryas <pdel@fb.com> wrote:
>> 
>> Signed-off-by: Peter Delevoryas <pdel@fb.com>
>> ---
>> hw/core/sysbus.c    | 6 ++++++
>> include/hw/sysbus.h | 2 ++
>> 2 files changed, 8 insertions(+)
>> 
>> diff --git a/hw/core/sysbus.c b/hw/core/sysbus.c
>> index cb4d6bae9d..7b63ec3fed 100644
>> --- a/hw/core/sysbus.c
>> +++ b/hw/core/sysbus.c
>> @@ -160,6 +160,12 @@ void sysbus_mmio_map(SysBusDevice *dev, int n, hwaddr addr)
>>     sysbus_mmio_map_common(dev, n, addr, false, 0, get_system_memory());
>> }
>> 
>> +void sysbus_mmio_map_in(SysBusDevice *dev, int n, hwaddr addr,
>> +                        MemoryRegion *system_memory)
>> +{
>> +    sysbus_mmio_map_common(dev, n, addr, false, 0, system_memory);
>> +}
>> +
>> void sysbus_mmio_map_overlap(SysBusDevice *dev, int n, hwaddr addr,
>>                              int priority)
>> {
>> diff --git a/include/hw/sysbus.h b/include/hw/sysbus.h
>> index a7c23d5fb1..f4578029e4 100644
>> --- a/include/hw/sysbus.h
>> +++ b/include/hw/sysbus.h
>> @@ -80,6 +80,8 @@ void sysbus_connect_irq(SysBusDevice *dev, int n, qemu_irq irq);
>> bool sysbus_is_irq_connected(SysBusDevice *dev, int n);
>> qemu_irq sysbus_get_connected_irq(SysBusDevice *dev, int n);
>> void sysbus_mmio_map(SysBusDevice *dev, int n, hwaddr addr);
>> +void sysbus_mmio_map_in(SysBusDevice *dev, int n, hwaddr addr,
>> +                        MemoryRegion *system_memory);
>> void sysbus_mmio_map_overlap(SysBusDevice *dev, int n, hwaddr addr,
>>                              int priority);
> 
> What's this going to be used for?

At the moment, just for SoC's to map peripherals into its memory region.

> 
> The current standard way to map a sysbus MMIO region into something
> other than the global system memory region is to do:
>   memory_region_add_subregion(&container, addr,
>                               sysbus_mmio_get_region(sbd, 0));

Oh!! Wait a minute. I should just change the Aspeed SoC code to
do that. I was kinda thinking that we wanted all the SoC code
to go through sysbus_mmio_map_common, but I realize now that
it’s really just a small helper that does some error checking,
and we can just implement that in an aspeed-specific helper if
necessary.

> 
> I'd rather not have two ways to do the same thing; we have
> far too many of those already.

Totally agree, I wasn’t very confident in this patch anyways.
I’ll remove it from the series and refactor the SoC board
patch following this to do what you suggested.

Thanks!!
Peter

> 
> thanks
> -- PMM


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 12/14] aspeed: Make aspeed_board_init_flashes public
  2022-06-23 15:09   ` [PATCH 12/14] aspeed: Make aspeed_board_init_flashes public Cédric Le Goater
@ 2022-06-23 18:43       ` Peter Delevoryas
  0 siblings, 0 replies; 19+ messages in thread
From: Peter Delevoryas @ 2022-06-23 18:43 UTC (permalink / raw)
  Cc: Peter Delevoryas, Peter Maydell, Andrew Jeffery, Joel Stanley,
	pbonzini, berrange, eduardo, i.mitsyanko@gmail.com.mst,
	marcel.apfelbaum, richard.henderson, Philippe Mathieu-Daudé,
	ani, Cameron Esfahani via, qemu-arm, kvm, Cédric Le Goater



> On Jun 23, 2022, at 8:09 AM, Cédric Le Goater <clg@kaod.org> wrote:
> 
> On 6/23/22 12:26, Peter Delevoryas wrote:
>> Signed-off-by: Peter Delevoryas <pdel@fb.com>
> 
> Let's start simple without flash support. We should be able to
> load FW blobs in each CPU address space using loader devices.

Actually, I was unable to do this, perhaps because the fb OpenBMC
boot sequence is a little weird. I specifically _needed_ to have
a flash device which maps the firmware in at 0x2000_0000, because
the fb OpenBMC U-Boot SPL jumps to that address to start executing
from flash? I think this is also why fb OpenBMC machines can be so slow.

$ ./build/qemu-system-arm -machine fby35 \
    -device loader,file=fby35.mtd,addr=0,cpu-num=0 -nographic \
    -d int -drive file=fby35.mtd,format=raw,if=mtd

U-Boot SPL 2019.04 fby35-e2294ff5d3 (Apr 15 2022 - 19:25:25 +0000)
SYS_INIT_RAM_END=10016000
CONFIG_SYS_INIT_SP_ADDR=10015000
CONFIG_MALLOC_F_ADDR=10012000
gd = sp = 10011f10
fdt=000182b4
Setup flash: write enable, addr4B, CE1 AHB 64MB window
Setup FMC_CE_CTRL = 0x00000033
Watchdog: 300s
hwstrap write protect SCU508=0x00000000, SCU518=0x00000000
Before: CE0_CTRL=0x00000600, CE1_CTRL=0x00000004
cs0_status = 1, cs1_status = 1
After: CE0_CTRL=0x00000400, CE1_CTRL=0x00000400
vboot_reset 504
SPL Could not find TPM (ret=-5)
Booting recovery U-Boot.
Blindly jumping to 0x20040000
QEMU 7.0.50 monitor - type 'help' for more information
(qemu) xp /x 0x20040000
0000000020040000: 0xea0000c0
(qemu) xp /x 0x00040000
0000000000040000: 0xea0000c0
(qemu) xp /x 0x00000000
0000000000000000: 0xea00001f
(qemu) xp /x 0x20000000
0000000020000000: 0xea00001f
(qemu) q
pdel@devvm9194:~/local/qemu ((79c196b...))
$ ./build/qemu-system-arm -machine fby35 -device loader,file=fby35.mtd,addr=0,cpu-num=0 -nographic -d int

U-Boot SPL 2019.04 fby35-e2294ff5d3 (Apr 15 2022 - 19:25:25 +0000)
SYS_INIT_RAM_END=10016000
CONFIG_SYS_INIT_SP_ADDR=10015000
CONFIG_MALLOC_F_ADDR=10012000
gd = sp = 10011f10
fdt=000182b4
Setup flash: write enable, addr4B, CE1 AHB 64MB window
Setup FMC_CE_CTRL = 0x00000033
Watchdog: 300s
hwstrap write protect SCU508=0x00000000, SCU518=0x00000000
Before: CE0_CTRL=0x00000600, CE1_CTRL=0x00000004
cs0_status = 1, cs1_status = 1
After: CE0_CTRL=0x00000400, CE1_CTRL=0x00000400
vboot_reset 504
SPL Could not find TPM (ret=-5)
Booting recovery U-Boot.
Blindly jumping to 0x20040000
Taking exception 1 [Undefined Instruction] on CPU 0
...from EL3 to EL3
...with ESR 0x0/0x2000000
QEMU 7.0.50 monitor - type 'help' for more information
(qemu) xp /x 0x20000000
0000000020000000: 0xffffffff
(qemu) xp /x 0x20040000
0000000020040000: 0xffffffff

> 
> Thanks,
> 
> C.
> 
>> ---
>>  hw/arm/aspeed.c             | 25 -------------------------
>>  hw/arm/aspeed_soc.c         | 26 ++++++++++++++++++++++++++
>>  include/hw/arm/aspeed_soc.h |  2 ++
>>  3 files changed, 28 insertions(+), 25 deletions(-)
>> diff --git a/hw/arm/aspeed.c b/hw/arm/aspeed.c
>> index 3aa74e88fb..c893ce84d7 100644
>> --- a/hw/arm/aspeed.c
>> +++ b/hw/arm/aspeed.c
>> @@ -278,31 +278,6 @@ static void write_boot_rom(DriveInfo *dinfo, hwaddr addr, size_t rom_size,
>>      rom_add_blob_fixed("aspeed.boot_rom", storage, rom_size, addr);
>>  }
>>  -static void aspeed_board_init_flashes(AspeedSMCState *s, const char *flashtype,
>> -                                      unsigned int count, int unit0)
>> -{
>> -    int i;
>> -
>> -    if (!flashtype) {
>> -        return;
>> -    }
>> -
>> -    for (i = 0; i < count; ++i) {
>> -        DriveInfo *dinfo = drive_get(IF_MTD, 0, unit0 + i);
>> -        qemu_irq cs_line;
>> -        DeviceState *dev;
>> -
>> -        dev = qdev_new(flashtype);
>> -        if (dinfo) {
>> -            qdev_prop_set_drive(dev, "drive", blk_by_legacy_dinfo(dinfo));
>> -        }
>> -        qdev_realize_and_unref(dev, BUS(s->spi), &error_fatal);
>> -
>> -        cs_line = qdev_get_gpio_in_named(dev, SSI_GPIO_CS, 0);
>> -        sysbus_connect_irq(SYS_BUS_DEVICE(s), i + 1, cs_line);
>> -    }
>> -}
>> -
>>  static void sdhci_attach_drive(SDHCIState *sdhci, DriveInfo *dinfo)
>>  {
>>          DeviceState *card;
>> diff --git a/hw/arm/aspeed_soc.c b/hw/arm/aspeed_soc.c
>> index b7e8506f28..33bfc06ed8 100644
>> --- a/hw/arm/aspeed_soc.c
>> +++ b/hw/arm/aspeed_soc.c
>> @@ -20,6 +20,7 @@
>>  #include "hw/i2c/aspeed_i2c.h"
>>  #include "net/net.h"
>>  #include "sysemu/sysemu.h"
>> +#include "sysemu/blockdev.h"
>>    #define ASPEED_SOC_IOMEM_SIZE       0x00200000
>>  @@ -579,3 +580,28 @@ void aspeed_soc_uart_init(AspeedSoCState *s)
>>                         serial_hd(i), DEVICE_LITTLE_ENDIAN);
>>      }
>>  }
>> +
>> +void aspeed_board_init_flashes(AspeedSMCState *s, const char *flashtype,
>> +                               unsigned int count, int unit0)
>> +{
>> +    int i;
>> +
>> +    if (!flashtype) {
>> +        return;
>> +    }
>> +
>> +    for (i = 0; i < count; ++i) {
>> +        DriveInfo *dinfo = drive_get(IF_MTD, 0, unit0 + i);
>> +        qemu_irq cs_line;
>> +        DeviceState *dev;
>> +
>> +        dev = qdev_new(flashtype);
>> +        if (dinfo) {
>> +            qdev_prop_set_drive(dev, "drive", blk_by_legacy_dinfo(dinfo));
>> +        }
>> +        qdev_realize_and_unref(dev, BUS(s->spi), &error_fatal);
>> +
>> +        cs_line = qdev_get_gpio_in_named(dev, SSI_GPIO_CS, 0);
>> +        sysbus_connect_irq(SYS_BUS_DEVICE(s), i + 1, cs_line);
>> +    }
>> +}
>> diff --git a/include/hw/arm/aspeed_soc.h b/include/hw/arm/aspeed_soc.h
>> index c68395ddbb..270d85d5de 100644
>> --- a/include/hw/arm/aspeed_soc.h
>> +++ b/include/hw/arm/aspeed_soc.h
>> @@ -166,5 +166,7 @@ enum {
>>    qemu_irq aspeed_soc_get_irq(AspeedSoCState *s, int dev);
>>  void aspeed_soc_uart_init(AspeedSoCState *s);
>> +void aspeed_board_init_flashes(AspeedSMCState *s, const char *flashtype,
>> +                               unsigned int count, int unit0);
>>    #endif /* ASPEED_SOC_H */
> 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 12/14] aspeed: Make aspeed_board_init_flashes public
@ 2022-06-23 18:43       ` Peter Delevoryas
  0 siblings, 0 replies; 19+ messages in thread
From: Peter Delevoryas @ 2022-06-23 18:43 UTC (permalink / raw)
  Cc: Peter Delevoryas, Peter Maydell, Andrew Jeffery, Joel Stanley,
	pbonzini, berrange, eduardo, i.mitsyanko@gmail.com.mst,
	marcel.apfelbaum, richard.henderson, Philippe Mathieu-Daudé,
	ani, Cameron Esfahani via, qemu-arm, kvm, Cédric Le Goater



> On Jun 23, 2022, at 8:09 AM, Cédric Le Goater <clg@kaod.org> wrote:
> 
> On 6/23/22 12:26, Peter Delevoryas wrote:
>> Signed-off-by: Peter Delevoryas <pdel@fb.com>
> 
> Let's start simple without flash support. We should be able to
> load FW blobs in each CPU address space using loader devices.

Actually, I was unable to do this, perhaps because the fb OpenBMC
boot sequence is a little weird. I specifically _needed_ to have
a flash device which maps the firmware in at 0x2000_0000, because
the fb OpenBMC U-Boot SPL jumps to that address to start executing
from flash? I think this is also why fb OpenBMC machines can be so slow.

$ ./build/qemu-system-arm -machine fby35 \
    -device loader,file=fby35.mtd,addr=0,cpu-num=0 -nographic \
    -d int -drive file=fby35.mtd,format=raw,if=mtd

U-Boot SPL 2019.04 fby35-e2294ff5d3 (Apr 15 2022 - 19:25:25 +0000)
SYS_INIT_RAM_END=10016000
CONFIG_SYS_INIT_SP_ADDR=10015000
CONFIG_MALLOC_F_ADDR=10012000
gd = sp = 10011f10
fdt=000182b4
Setup flash: write enable, addr4B, CE1 AHB 64MB window
Setup FMC_CE_CTRL = 0x00000033
Watchdog: 300s
hwstrap write protect SCU508=0x00000000, SCU518=0x00000000
Before: CE0_CTRL=0x00000600, CE1_CTRL=0x00000004
cs0_status = 1, cs1_status = 1
After: CE0_CTRL=0x00000400, CE1_CTRL=0x00000400
vboot_reset 504
SPL Could not find TPM (ret=-5)
Booting recovery U-Boot.
Blindly jumping to 0x20040000
QEMU 7.0.50 monitor - type 'help' for more information
(qemu) xp /x 0x20040000
0000000020040000: 0xea0000c0
(qemu) xp /x 0x00040000
0000000000040000: 0xea0000c0
(qemu) xp /x 0x00000000
0000000000000000: 0xea00001f
(qemu) xp /x 0x20000000
0000000020000000: 0xea00001f
(qemu) q
pdel@devvm9194:~/local/qemu ((79c196b...))
$ ./build/qemu-system-arm -machine fby35 -device loader,file=fby35.mtd,addr=0,cpu-num=0 -nographic -d int

U-Boot SPL 2019.04 fby35-e2294ff5d3 (Apr 15 2022 - 19:25:25 +0000)
SYS_INIT_RAM_END=10016000
CONFIG_SYS_INIT_SP_ADDR=10015000
CONFIG_MALLOC_F_ADDR=10012000
gd = sp = 10011f10
fdt=000182b4
Setup flash: write enable, addr4B, CE1 AHB 64MB window
Setup FMC_CE_CTRL = 0x00000033
Watchdog: 300s
hwstrap write protect SCU508=0x00000000, SCU518=0x00000000
Before: CE0_CTRL=0x00000600, CE1_CTRL=0x00000004
cs0_status = 1, cs1_status = 1
After: CE0_CTRL=0x00000400, CE1_CTRL=0x00000400
vboot_reset 504
SPL Could not find TPM (ret=-5)
Booting recovery U-Boot.
Blindly jumping to 0x20040000
Taking exception 1 [Undefined Instruction] on CPU 0
...from EL3 to EL3
...with ESR 0x0/0x2000000
QEMU 7.0.50 monitor - type 'help' for more information
(qemu) xp /x 0x20000000
0000000020000000: 0xffffffff
(qemu) xp /x 0x20040000
0000000020040000: 0xffffffff

> 
> Thanks,
> 
> C.
> 
>> ---
>>  hw/arm/aspeed.c             | 25 -------------------------
>>  hw/arm/aspeed_soc.c         | 26 ++++++++++++++++++++++++++
>>  include/hw/arm/aspeed_soc.h |  2 ++
>>  3 files changed, 28 insertions(+), 25 deletions(-)
>> diff --git a/hw/arm/aspeed.c b/hw/arm/aspeed.c
>> index 3aa74e88fb..c893ce84d7 100644
>> --- a/hw/arm/aspeed.c
>> +++ b/hw/arm/aspeed.c
>> @@ -278,31 +278,6 @@ static void write_boot_rom(DriveInfo *dinfo, hwaddr addr, size_t rom_size,
>>      rom_add_blob_fixed("aspeed.boot_rom", storage, rom_size, addr);
>>  }
>>  -static void aspeed_board_init_flashes(AspeedSMCState *s, const char *flashtype,
>> -                                      unsigned int count, int unit0)
>> -{
>> -    int i;
>> -
>> -    if (!flashtype) {
>> -        return;
>> -    }
>> -
>> -    for (i = 0; i < count; ++i) {
>> -        DriveInfo *dinfo = drive_get(IF_MTD, 0, unit0 + i);
>> -        qemu_irq cs_line;
>> -        DeviceState *dev;
>> -
>> -        dev = qdev_new(flashtype);
>> -        if (dinfo) {
>> -            qdev_prop_set_drive(dev, "drive", blk_by_legacy_dinfo(dinfo));
>> -        }
>> -        qdev_realize_and_unref(dev, BUS(s->spi), &error_fatal);
>> -
>> -        cs_line = qdev_get_gpio_in_named(dev, SSI_GPIO_CS, 0);
>> -        sysbus_connect_irq(SYS_BUS_DEVICE(s), i + 1, cs_line);
>> -    }
>> -}
>> -
>>  static void sdhci_attach_drive(SDHCIState *sdhci, DriveInfo *dinfo)
>>  {
>>          DeviceState *card;
>> diff --git a/hw/arm/aspeed_soc.c b/hw/arm/aspeed_soc.c
>> index b7e8506f28..33bfc06ed8 100644
>> --- a/hw/arm/aspeed_soc.c
>> +++ b/hw/arm/aspeed_soc.c
>> @@ -20,6 +20,7 @@
>>  #include "hw/i2c/aspeed_i2c.h"
>>  #include "net/net.h"
>>  #include "sysemu/sysemu.h"
>> +#include "sysemu/blockdev.h"
>>    #define ASPEED_SOC_IOMEM_SIZE       0x00200000
>>  @@ -579,3 +580,28 @@ void aspeed_soc_uart_init(AspeedSoCState *s)
>>                         serial_hd(i), DEVICE_LITTLE_ENDIAN);
>>      }
>>  }
>> +
>> +void aspeed_board_init_flashes(AspeedSMCState *s, const char *flashtype,
>> +                               unsigned int count, int unit0)
>> +{
>> +    int i;
>> +
>> +    if (!flashtype) {
>> +        return;
>> +    }
>> +
>> +    for (i = 0; i < count; ++i) {
>> +        DriveInfo *dinfo = drive_get(IF_MTD, 0, unit0 + i);
>> +        qemu_irq cs_line;
>> +        DeviceState *dev;
>> +
>> +        dev = qdev_new(flashtype);
>> +        if (dinfo) {
>> +            qdev_prop_set_drive(dev, "drive", blk_by_legacy_dinfo(dinfo));
>> +        }
>> +        qdev_realize_and_unref(dev, BUS(s->spi), &error_fatal);
>> +
>> +        cs_line = qdev_get_gpio_in_named(dev, SSI_GPIO_CS, 0);
>> +        sysbus_connect_irq(SYS_BUS_DEVICE(s), i + 1, cs_line);
>> +    }
>> +}
>> diff --git a/include/hw/arm/aspeed_soc.h b/include/hw/arm/aspeed_soc.h
>> index c68395ddbb..270d85d5de 100644
>> --- a/include/hw/arm/aspeed_soc.h
>> +++ b/include/hw/arm/aspeed_soc.h
>> @@ -166,5 +166,7 @@ enum {
>>    qemu_irq aspeed_soc_get_irq(AspeedSoCState *s, int dev);
>>  void aspeed_soc_uart_init(AspeedSoCState *s);
>> +void aspeed_board_init_flashes(AspeedSMCState *s, const char *flashtype,
>> +                               unsigned int count, int unit0);
>>    #endif /* ASPEED_SOC_H */
> 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 08/14] aspeed: Replace direct get_system_memory() calls
  2022-06-23 15:39     ` Cédric Le Goater
@ 2022-06-23 18:45         ` Peter Delevoryas
  0 siblings, 0 replies; 19+ messages in thread
From: Peter Delevoryas @ 2022-06-23 18:45 UTC (permalink / raw)
  Cc: Peter Delevoryas, Peter Maydell, Andrew Jeffery, Joel Stanley,
	pbonzini, berrange, eduardo, marcel.apfelbaum, richard.henderson,
	Philippe Mathieu-Daudé,
	ani, Cameron Esfahani via, qemu-arm, kvm, Cédric Le Goater



> On Jun 23, 2022, at 8:39 AM, Cédric Le Goater <clg@kaod.org> wrote:
> 
> On 6/23/22 14:57, Peter Maydell wrote:
>> On Thu, 23 Jun 2022 at 13:37, Peter Delevoryas <pdel@fb.com> wrote:
>>> 
>>> Note: sysbus_mmio_map(), sysbus_mmio_map_overlap(), and others are still
>>> using get_system_memory indirectly.
>>> 
>>> Signed-off-by: Peter Delevoryas <pdel@fb.com>
>>> ---
>>>  hw/arm/aspeed.c         | 8 ++++----
>>>  hw/arm/aspeed_ast10x0.c | 5 ++---
>>>  hw/arm/aspeed_ast2600.c | 2 +-
>>>  hw/arm/aspeed_soc.c     | 6 +++---
>>>  4 files changed, 10 insertions(+), 11 deletions(-)
>>> 
>>> diff --git a/hw/arm/aspeed.c b/hw/arm/aspeed.c
>>> index 8dae155183..3aa74e88fb 100644
>>> --- a/hw/arm/aspeed.c
>>> +++ b/hw/arm/aspeed.c
>>> @@ -371,7 +371,7 @@ static void aspeed_machine_init(MachineState *machine)
>>>                           amc->uart_default);
>>>      qdev_realize(DEVICE(&bmc->soc), NULL, &error_abort);
>>> 
>>> -    memory_region_add_subregion(get_system_memory(),
>>> +    memory_region_add_subregion(bmc->soc.system_memory,
>>>                                  sc->memmap[ASPEED_DEV_SDRAM],
>>>                                  &bmc->ram_container);
>> This is board code, it shouldn't be reaching into the internals
>> of the SoC object like this. The board code probably already
>> has the right MemoryRegion because it was the one that passed
>> it to the SoC link porperty in the first place.
> 
> It's a bit messy currently. May be I got it wrong initially. The
> board allocates a ram container region in which the machine ram
> region is mapped. See commit ad1a9782186d ("aspeed: add a RAM memory
> region container")
> 
> There is an extra region after ram in the ram container to catch
> invalid access done by FW. That's how FW determines the size of
> ram. See commit ebe31c0a8ef7 ("aspeed: add a max_ram_size property
> to the memory controller")
> 
> But I think I can safely move all the RAM logic under the board.
> I will send a patch.

Ah yes, I noticed that this was odd too, and that the DRAM probably
should have been mapped inside the SoC code. I didn’t know exactly
how to solve it easily though. Thanks for sending a patch Cedric!!

> 
> Thanks,
> 
> C.
> 
>> thanks
>> -- PMM
> 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 08/14] aspeed: Replace direct get_system_memory() calls
@ 2022-06-23 18:45         ` Peter Delevoryas
  0 siblings, 0 replies; 19+ messages in thread
From: Peter Delevoryas @ 2022-06-23 18:45 UTC (permalink / raw)
  Cc: Peter Delevoryas, Peter Maydell, Andrew Jeffery, Joel Stanley,
	pbonzini, berrange, eduardo, marcel.apfelbaum, richard.henderson,
	Philippe Mathieu-Daudé,
	ani, Cameron Esfahani via, qemu-arm, kvm, Cédric Le Goater



> On Jun 23, 2022, at 8:39 AM, Cédric Le Goater <clg@kaod.org> wrote:
> 
> On 6/23/22 14:57, Peter Maydell wrote:
>> On Thu, 23 Jun 2022 at 13:37, Peter Delevoryas <pdel@fb.com> wrote:
>>> 
>>> Note: sysbus_mmio_map(), sysbus_mmio_map_overlap(), and others are still
>>> using get_system_memory indirectly.
>>> 
>>> Signed-off-by: Peter Delevoryas <pdel@fb.com>
>>> ---
>>>  hw/arm/aspeed.c         | 8 ++++----
>>>  hw/arm/aspeed_ast10x0.c | 5 ++---
>>>  hw/arm/aspeed_ast2600.c | 2 +-
>>>  hw/arm/aspeed_soc.c     | 6 +++---
>>>  4 files changed, 10 insertions(+), 11 deletions(-)
>>> 
>>> diff --git a/hw/arm/aspeed.c b/hw/arm/aspeed.c
>>> index 8dae155183..3aa74e88fb 100644
>>> --- a/hw/arm/aspeed.c
>>> +++ b/hw/arm/aspeed.c
>>> @@ -371,7 +371,7 @@ static void aspeed_machine_init(MachineState *machine)
>>>                           amc->uart_default);
>>>      qdev_realize(DEVICE(&bmc->soc), NULL, &error_abort);
>>> 
>>> -    memory_region_add_subregion(get_system_memory(),
>>> +    memory_region_add_subregion(bmc->soc.system_memory,
>>>                                  sc->memmap[ASPEED_DEV_SDRAM],
>>>                                  &bmc->ram_container);
>> This is board code, it shouldn't be reaching into the internals
>> of the SoC object like this. The board code probably already
>> has the right MemoryRegion because it was the one that passed
>> it to the SoC link porperty in the first place.
> 
> It's a bit messy currently. May be I got it wrong initially. The
> board allocates a ram container region in which the machine ram
> region is mapped. See commit ad1a9782186d ("aspeed: add a RAM memory
> region container")
> 
> There is an extra region after ram in the ram container to catch
> invalid access done by FW. That's how FW determines the size of
> ram. See commit ebe31c0a8ef7 ("aspeed: add a max_ram_size property
> to the memory controller")
> 
> But I think I can safely move all the RAM logic under the board.
> I will send a patch.

Ah yes, I noticed that this was odd too, and that the DRAM probably
should have been mapped inside the SoC code. I didn’t know exactly
how to solve it easily though. Thanks for sending a patch Cedric!!

> 
> Thanks,
> 
> C.
> 
>> thanks
>> -- PMM
> 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 12/14] aspeed: Make aspeed_board_init_flashes public
  2022-06-23 18:43       ` Peter Delevoryas
  (?)
@ 2022-06-24 16:50       ` Cédric Le Goater
  2022-06-29  9:11         ` Cédric Le Goater
  -1 siblings, 1 reply; 19+ messages in thread
From: Cédric Le Goater @ 2022-06-24 16:50 UTC (permalink / raw)
  To: Peter Delevoryas
  Cc: Peter Maydell, Andrew Jeffery, Joel Stanley, pbonzini, berrange,
	eduardo, marcel.apfelbaum, richard.henderson,
	Philippe Mathieu-Daudé,
	ani, Cameron Esfahani via, qemu-arm

On 6/23/22 20:43, Peter Delevoryas wrote:
> 
> 
>> On Jun 23, 2022, at 8:09 AM, Cédric Le Goater <clg@kaod.org> wrote:
>>
>> On 6/23/22 12:26, Peter Delevoryas wrote:
>>> Signed-off-by: Peter Delevoryas <pdel@fb.com>
>>
>> Let's start simple without flash support. We should be able to
>> load FW blobs in each CPU address space using loader devices.
> 
> Actually, I was unable to do this, perhaps because the fb OpenBMC
> boot sequence is a little weird. I specifically _needed_ to have
> a flash device which maps the firmware in at 0x2000_0000, because
> the fb OpenBMC U-Boot SPL jumps to that address to start executing
> from flash? I think this is also why fb OpenBMC machines can be so slow.
> 
> $ ./build/qemu-system-arm -machine fby35 \
>      -device loader,file=fby35.mtd,addr=0,cpu-num=0 -nographic \
>      -d int -drive file=fby35.mtd,format=raw,if=mtd



Ideally we should be booting from the flash device directly using
the machine option '-M ast2600-evb,execute-in-place=true' like HW
does. Instructions are fetched using SPI transfers. But the amount
of code generated is tremendous. See some profiling below for a
run which barely reaches DRAM training in U-Boot.

C.


* execute-in-place=true

Each sample counts as 0.01 seconds.
   %   cumulative   self              self     total
  time   seconds   seconds    calls  ns/call  ns/call  name
100.00      0.02     0.02   164276   121.75   121.75  memory_region_init_rom_device
   0.00      0.02     0.00 1610346008     0.00     0.00  tcg_code_capacity
   0.00      0.02     0.00 567612621     0.00     0.00  type_register_static_array
   0.00      0.02     0.00 328886191     0.00     0.00  do_common_semihosting
   0.00      0.02     0.00 297215811     0.00     0.00  container_get
   0.00      0.02     0.00 292670030     0.00     0.00  arm_cpu_tlb_fill
   0.00      0.02     0.00 195416119     0.00     0.00  arm_cpu_register_gdb_regs_for_features
   0.00      0.02     0.00 193326677     0.00     0.00  object_type_get_instance_size
   0.00      0.02     0.00 182365829     0.00     0.00  tcg_op_insert_after
   0.00      0.02     0.00 150668458     0.00     0.00  plugin_gen_tb_end
   0.00      0.02     0.00 142171940     0.00     0.00  gen_new_label
   0.00      0.02     0.00 133200628     0.00     0.00  smbios_build_type_38_table
   0.00      0.02     0.00 130540338     0.00     0.00  object_dynamic_cast_assert
   0.00      0.02     0.00 129223195     0.00     0.00  cpu_loop_exit_atomic
   0.00      0.02     0.00 121759298     0.00     0.00  tcg_remove_ops_after
   0.00      0.02     0.00 116887887     0.00     0.00  in_code_gen_buffer
   0.00      0.02     0.00 111803833     0.00     0.00  tcg_emit_op
   0.00      0.02     0.00 106052221     0.00     0.00  object_class_dynamic_cast_assert
   0.00      0.02     0.00 99704054     0.00     0.00  __jit_debug_register_code
   0.00      0.02     0.00 97812458     0.00     0.00  object_get_class
   0.00      0.02     0.00 88952594     0.00     0.00  tcg_splitwx_to_rx
   0.00      0.02     0.00 85790920     0.00     0.00  object_class_dynamic_cast
   0.00      0.02     0.00 73780673     0.00     0.00  helper_exit_atomic
   0.00      0.02     0.00 65337482     0.00     0.00  tcg_op_supported
   0.00      0.02     0.00 61213619     0.00     0.00  tcg_func_start
   0.00      0.02     0.00 54477684     0.00     0.00  tcg_flush_softmmu_tlb
   0.00      0.02     0.00 53968980     0.00     0.00  tcg_temp_new_internal
   0.00      0.02     0.00 51526008     0.00     0.00  qemu_in_vcpu_thread
   0.00      0.02     0.00 40750952     0.00     0.00  pflash_cfi02_register
   0.00      0.02     0.00 38039442     0.00     0.00  tcg_gen_op2
   0.00      0.02     0.00 37068039     0.00     0.00  tcg_gen_op1
   0.00      0.02     0.00 36473276     0.00     0.00  tcg_gen_op3
   0.00      0.02     0.00 36310225     0.00     0.00  gen_gvec_uaba
   0.00      0.02     0.00 30985436     0.00     0.00  tb_set_jmp_target
   0.00      0.02     0.00 30291796     0.00     0.00  tcg_constant_internal
   0.00      0.02     0.00 29857950     0.00     0.00  ssi_transfer

* execute-in-place=false

Each sample counts as 0.01 seconds.
   %   cumulative   self              self     total
  time   seconds   seconds    calls  ns/call  ns/call  name
  40.00      0.02     0.02   551149    36.29    36.29  aspeed_board_init_flashes
  20.00      0.03     0.01  3937238     2.54     2.54  register_cp_regs_for_features
  20.00      0.04     0.01   674096    14.83    14.83  gen_gvec_uaba
  20.00      0.05     0.01   457461    21.86    21.86  finalize_target_page_bits
   0.00      0.05     0.00  5364258     0.00     0.00  arm_gt_hvtimer_cb
   0.00      0.05     0.00  2467532     0.00     0.00  helper_neon_narrow_sat_s8
   0.00      0.05     0.00  2431860     0.00     0.00  opb_opb2fsi_address
   0.00      0.05     0.00  1828453     0.00     0.00  cpsr_read
   0.00      0.05     0.00  1820659     0.00     0.00  cpu_get_tb_cpu_state
   0.00      0.05     0.00  1441344     0.00     0.00  arm_cpu_tlb_fill
   0.00      0.05     0.00  1427177     0.00     0.00  cxl_usp_to_cstate
   0.00      0.05     0.00  1161059     0.00     5.85  aarch64_sync_64_to_32
   0.00      0.05     0.00   886523     0.00     0.00  helper_iwmmxt_maxsb
   0.00      0.05     0.00   831393     0.00     0.00  arm_log_exception
   0.00      0.05     0.00   746940     0.00     0.00  helper_v7m_preserve_fp_state
   0.00      0.05     0.00   728354     0.00     0.00  hmp_calc_dirty_rate
   0.00      0.05     0.00   681634     0.00     0.00  helper_sadd8
   0.00      0.05     0.00   487743     0.00     7.14  qmp_query_cpu_definitions
   0.00      0.05     0.00   420528     0.00     0.00  arm_v7m_cpu_do_interrupt
   0.00      0.05     0.00   382245     0.00     0.00  helper_ssub8
   0.00      0.05     0.00   374192     0.00     0.00  helper_usub8
   0.00      0.05     0.00   347199     0.00     0.00  usb_msd_load_request
   0.00      0.05     0.00   325862     0.00     0.00  target_disas
   0.00      0.05     0.00   322375     0.00     0.00  arm_hcrx_el2_eff
   0.00      0.05     0.00   317835     0.00     0.00  virtio_bus_device_iommu_enabled
   0.00      0.05     0.00   309559     0.00     0.00  mig_throttle_counter_reset
   0.00      0.05     0.00   301557     0.00     0.00  ram_bytes_remaining
   0.00      0.05     0.00   292888     0.00     0.00  helper_v7m_blxns
   0.00      0.05     0.00   289093     0.00     0.00  tpm_util_show_buffer
   0.00      0.05     0.00   274156     0.00     0.00  helper_sxtb16
   0.00      0.05     0.00   273588     0.00     0.00  write_v7m_exception
   0.00      0.05     0.00   271619     0.00     0.00  page_size_init
   0.00      0.05     0.00   270247     0.00     0.00  qemu_fdt_setprop_sized_cells_from_array
   0.00      0.05     0.00   229643     0.00    14.69  helper_neon_addl_u32


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 12/14] aspeed: Make aspeed_board_init_flashes public
  2022-06-24 16:50       ` Cédric Le Goater
@ 2022-06-29  9:11         ` Cédric Le Goater
  2022-06-29 14:14           ` Alex Bennée
  0 siblings, 1 reply; 19+ messages in thread
From: Cédric Le Goater @ 2022-06-29  9:11 UTC (permalink / raw)
  To: Peter Delevoryas
  Cc: Peter Maydell, Andrew Jeffery, Joel Stanley, pbonzini, berrange,
	eduardo, marcel.apfelbaum, richard.henderson,
	Philippe Mathieu-Daudé,
	ani, Cameron Esfahani via, qemu-arm, Alex Bennée

On 6/24/22 18:50, Cédric Le Goater wrote:
> On 6/23/22 20:43, Peter Delevoryas wrote:
>>
>>
>>> On Jun 23, 2022, at 8:09 AM, Cédric Le Goater <clg@kaod.org> wrote:
>>>
>>> On 6/23/22 12:26, Peter Delevoryas wrote:
>>>> Signed-off-by: Peter Delevoryas <pdel@fb.com>
>>>
>>> Let's start simple without flash support. We should be able to
>>> load FW blobs in each CPU address space using loader devices.
>>
>> Actually, I was unable to do this, perhaps because the fb OpenBMC
>> boot sequence is a little weird. I specifically _needed_ to have
>> a flash device which maps the firmware in at 0x2000_0000, because
>> the fb OpenBMC U-Boot SPL jumps to that address to start executing
>> from flash? I think this is also why fb OpenBMC machines can be so slow.
>>
>> $ ./build/qemu-system-arm -machine fby35 \
>>      -device loader,file=fby35.mtd,addr=0,cpu-num=0 -nographic \
>>      -d int -drive file=fby35.mtd,format=raw,if=mtd
> 
> 
> 
> Ideally we should be booting from the flash device directly using
> the machine option '-M ast2600-evb,execute-in-place=true' like HW
> does. Instructions are fetched using SPI transfers. But the amount
> of code generated is tremendous. See some profiling below for a
> run which barely reaches DRAM training in U-Boot.

Some more profiling on both ast2500 and ast2600 machines shows :


* ast2600-evb,execute-in-place=true :

Type               Object  Call site                Wait Time (s)         Count  Average (us)
---------------------------------------------------------------------------------------------
BQL mutex  0x564dc03922e0  accel/tcg/cputlb.c:1365       14.21443      32909927          0.43
condvar    0x564dc0f02988  util/thread-pool.c:90         10.02312            56     178984.32
condvar    [           2]  softmmu/cpus.c:423             0.10051             6      16752.04
BQL mutex  0x564dc03922e0  util/rcu.c:269                 0.04372             4      10930.60
BQL mutex  0x564dc03922e0  cpus-common.c:341              0.00151             8        189.16
condvar    0x564dc0390360  cpus-common.c:176              0.00092             8        115.04
condvar    0x564dc0392280  softmmu/cpus.c:642             0.00013             2         65.04
condvar    0x564dc0392240  softmmu/cpus.c:571             0.00010             2         49.54
BQL mutex  0x564dc03922e0  accel/tcg/cputlb.c:1426        0.00006           467          0.14
condvar    0x564dc03903a0  cpus-common.c:206              0.00004             8          5.28
---------------------------------------------------------------------------------------------


* ast2500-evb,execute-in-place=true :

Type               Object  Call site                Wait Time (s)         Count  Average (us)
---------------------------------------------------------------------------------------------
condvar    0x55a581137f88  util/thread-pool.c:90         10.01158            28     357556.50
BQL mutex  0x55a57f0e02e0  accel/tcg/cputlb.c:1365        0.29886      14394475          0.02
condvar    0x55a5814cb5a0  softmmu/cpus.c:423             0.02182             2      10912.44
BQL mutex  0x55a57f0e02e0  util/rcu.c:269                 0.01420             4       3549.56
mutex      0x55a5813381c0  tcg/region.c:204               0.00007          3052          0.02
condvar    0x55a57f0e0280  softmmu/cpus.c:642             0.00006             1         59.79
mutex      [           2]  chardev/char.c:118             0.00003          1492          0.02
BQL mutex  0x55a57f0e02e0  util/main-loop.c:318           0.00002            34          0.72
BQL mutex  0x55a57f0e02e0  accel/tcg/cputlb.c:1426        0.00002           973          0.02
condvar    0x55a57f0e0240  softmmu/cpus.c:571             0.00002             1         15.16
---------------------------------------------------------------------------------------------

C.



> 
> * execute-in-place=true
> 
> Each sample counts as 0.01 seconds.
>    %   cumulative   self              self     total
>   time   seconds   seconds    calls  ns/call  ns/call  name
> 100.00      0.02     0.02   164276   121.75   121.75  memory_region_init_rom_device
>    0.00      0.02     0.00 1610346008     0.00     0.00  tcg_code_capacity
>    0.00      0.02     0.00 567612621     0.00     0.00  type_register_static_array
>    0.00      0.02     0.00 328886191     0.00     0.00  do_common_semihosting
>    0.00      0.02     0.00 297215811     0.00     0.00  container_get
>    0.00      0.02     0.00 292670030     0.00     0.00  arm_cpu_tlb_fill
>    0.00      0.02     0.00 195416119     0.00     0.00  arm_cpu_register_gdb_regs_for_features
>    0.00      0.02     0.00 193326677     0.00     0.00  object_type_get_instance_size
>    0.00      0.02     0.00 182365829     0.00     0.00  tcg_op_insert_after
>    0.00      0.02     0.00 150668458     0.00     0.00  plugin_gen_tb_end
>    0.00      0.02     0.00 142171940     0.00     0.00  gen_new_label
>    0.00      0.02     0.00 133200628     0.00     0.00  smbios_build_type_38_table
>    0.00      0.02     0.00 130540338     0.00     0.00  object_dynamic_cast_assert
>    0.00      0.02     0.00 129223195     0.00     0.00  cpu_loop_exit_atomic
>    0.00      0.02     0.00 121759298     0.00     0.00  tcg_remove_ops_after
>    0.00      0.02     0.00 116887887     0.00     0.00  in_code_gen_buffer
>    0.00      0.02     0.00 111803833     0.00     0.00  tcg_emit_op
>    0.00      0.02     0.00 106052221     0.00     0.00  object_class_dynamic_cast_assert
>    0.00      0.02     0.00 99704054     0.00     0.00  __jit_debug_register_code
>    0.00      0.02     0.00 97812458     0.00     0.00  object_get_class
>    0.00      0.02     0.00 88952594     0.00     0.00  tcg_splitwx_to_rx
>    0.00      0.02     0.00 85790920     0.00     0.00  object_class_dynamic_cast
>    0.00      0.02     0.00 73780673     0.00     0.00  helper_exit_atomic
>    0.00      0.02     0.00 65337482     0.00     0.00  tcg_op_supported
>    0.00      0.02     0.00 61213619     0.00     0.00  tcg_func_start
>    0.00      0.02     0.00 54477684     0.00     0.00  tcg_flush_softmmu_tlb
>    0.00      0.02     0.00 53968980     0.00     0.00  tcg_temp_new_internal
>    0.00      0.02     0.00 51526008     0.00     0.00  qemu_in_vcpu_thread
>    0.00      0.02     0.00 40750952     0.00     0.00  pflash_cfi02_register
>    0.00      0.02     0.00 38039442     0.00     0.00  tcg_gen_op2
>    0.00      0.02     0.00 37068039     0.00     0.00  tcg_gen_op1
>    0.00      0.02     0.00 36473276     0.00     0.00  tcg_gen_op3
>    0.00      0.02     0.00 36310225     0.00     0.00  gen_gvec_uaba
>    0.00      0.02     0.00 30985436     0.00     0.00  tb_set_jmp_target
>    0.00      0.02     0.00 30291796     0.00     0.00  tcg_constant_internal
>    0.00      0.02     0.00 29857950     0.00     0.00  ssi_transfer
> 
> * execute-in-place=false
> 
> Each sample counts as 0.01 seconds.
>    %   cumulative   self              self     total
>   time   seconds   seconds    calls  ns/call  ns/call  name
>   40.00      0.02     0.02   551149    36.29    36.29  aspeed_board_init_flashes
>   20.00      0.03     0.01  3937238     2.54     2.54  register_cp_regs_for_features
>   20.00      0.04     0.01   674096    14.83    14.83  gen_gvec_uaba
>   20.00      0.05     0.01   457461    21.86    21.86  finalize_target_page_bits
>    0.00      0.05     0.00  5364258     0.00     0.00  arm_gt_hvtimer_cb
>    0.00      0.05     0.00  2467532     0.00     0.00  helper_neon_narrow_sat_s8
>    0.00      0.05     0.00  2431860     0.00     0.00  opb_opb2fsi_address
>    0.00      0.05     0.00  1828453     0.00     0.00  cpsr_read
>    0.00      0.05     0.00  1820659     0.00     0.00  cpu_get_tb_cpu_state
>    0.00      0.05     0.00  1441344     0.00     0.00  arm_cpu_tlb_fill
>    0.00      0.05     0.00  1427177     0.00     0.00  cxl_usp_to_cstate
>    0.00      0.05     0.00  1161059     0.00     5.85  aarch64_sync_64_to_32
>    0.00      0.05     0.00   886523     0.00     0.00  helper_iwmmxt_maxsb
>    0.00      0.05     0.00   831393     0.00     0.00  arm_log_exception
>    0.00      0.05     0.00   746940     0.00     0.00  helper_v7m_preserve_fp_state
>    0.00      0.05     0.00   728354     0.00     0.00  hmp_calc_dirty_rate
>    0.00      0.05     0.00   681634     0.00     0.00  helper_sadd8
>    0.00      0.05     0.00   487743     0.00     7.14  qmp_query_cpu_definitions
>    0.00      0.05     0.00   420528     0.00     0.00  arm_v7m_cpu_do_interrupt
>    0.00      0.05     0.00   382245     0.00     0.00  helper_ssub8
>    0.00      0.05     0.00   374192     0.00     0.00  helper_usub8
>    0.00      0.05     0.00   347199     0.00     0.00  usb_msd_load_request
>    0.00      0.05     0.00   325862     0.00     0.00  target_disas
>    0.00      0.05     0.00   322375     0.00     0.00  arm_hcrx_el2_eff
>    0.00      0.05     0.00   317835     0.00     0.00  virtio_bus_device_iommu_enabled
>    0.00      0.05     0.00   309559     0.00     0.00  mig_throttle_counter_reset
>    0.00      0.05     0.00   301557     0.00     0.00  ram_bytes_remaining
>    0.00      0.05     0.00   292888     0.00     0.00  helper_v7m_blxns
>    0.00      0.05     0.00   289093     0.00     0.00  tpm_util_show_buffer
>    0.00      0.05     0.00   274156     0.00     0.00  helper_sxtb16
>    0.00      0.05     0.00   273588     0.00     0.00  write_v7m_exception
>    0.00      0.05     0.00   271619     0.00     0.00  page_size_init
>    0.00      0.05     0.00   270247     0.00     0.00  qemu_fdt_setprop_sized_cells_from_array
>    0.00      0.05     0.00   229643     0.00    14.69  helper_neon_addl_u32



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 12/14] aspeed: Make aspeed_board_init_flashes public
  2022-06-29  9:11         ` Cédric Le Goater
@ 2022-06-29 14:14           ` Alex Bennée
  2022-06-29 15:54             ` Cédric Le Goater
  0 siblings, 1 reply; 19+ messages in thread
From: Alex Bennée @ 2022-06-29 14:14 UTC (permalink / raw)
  To: Cédric Le Goater
  Cc: Peter Delevoryas, Peter Maydell, Andrew Jeffery, Joel Stanley,
	pbonzini, berrange, eduardo, marcel.apfelbaum, richard.henderson,
	Philippe Mathieu-Daudé,
	ani, Cameron Esfahani via, qemu-arm


Cédric Le Goater <clg@kaod.org> writes:

> On 6/24/22 18:50, Cédric Le Goater wrote:
>> On 6/23/22 20:43, Peter Delevoryas wrote:
>>>
>>>
>>>> On Jun 23, 2022, at 8:09 AM, Cédric Le Goater <clg@kaod.org> wrote:
>>>>
>>>> On 6/23/22 12:26, Peter Delevoryas wrote:
>>>>> Signed-off-by: Peter Delevoryas <pdel@fb.com>
>>>>
>>>> Let's start simple without flash support. We should be able to
>>>> load FW blobs in each CPU address space using loader devices.
>>>
>>> Actually, I was unable to do this, perhaps because the fb OpenBMC
>>> boot sequence is a little weird. I specifically _needed_ to have
>>> a flash device which maps the firmware in at 0x2000_0000, because
>>> the fb OpenBMC U-Boot SPL jumps to that address to start executing
>>> from flash? I think this is also why fb OpenBMC machines can be so slow.
>>>
>>> $ ./build/qemu-system-arm -machine fby35 \
>>>      -device loader,file=fby35.mtd,addr=0,cpu-num=0 -nographic \
>>>      -d int -drive file=fby35.mtd,format=raw,if=mtd
>> Ideally we should be booting from the flash device directly using
>> the machine option '-M ast2600-evb,execute-in-place=true' like HW
>> does. Instructions are fetched using SPI transfers. But the amount
>> of code generated is tremendous.

Yeah because there is a potential race when reading from HW so we throw
away TB's after executing them because we have no way of knowing if it
has changed under our feet. See 873d64ac30 (accel/tcg: re-factor non-RAM
execution code) which cleaned up this handling.

>> See some profiling below for a
>> run which barely reaches DRAM training in U-Boot.
>
> Some more profiling on both ast2500 and ast2600 machines shows :
>
>
> * ast2600-evb,execute-in-place=true :
>
> Type               Object  Call site                Wait Time (s)         Count  Average (us)
> ---------------------------------------------------------------------------------------------
> BQL mutex  0x564dc03922e0  accel/tcg/cputlb.c:1365       14.21443
> 32909927          0.43

This is unavoidable as a HW access needs the BQL held so we will go
through this cycle every executed instruction.

Did I miss why the flash contents are not mapped into the physical
address space? Isn't that how it appear to the processor?

> condvar    0x564dc0f02988  util/thread-pool.c:90         10.02312            56     178984.32
> condvar    [           2]  softmmu/cpus.c:423             0.10051             6      16752.04
> BQL mutex  0x564dc03922e0  util/rcu.c:269                 0.04372             4      10930.60
> BQL mutex  0x564dc03922e0  cpus-common.c:341              0.00151             8        189.16
> condvar    0x564dc0390360  cpus-common.c:176              0.00092             8        115.04
> condvar    0x564dc0392280  softmmu/cpus.c:642             0.00013             2         65.04
> condvar    0x564dc0392240  softmmu/cpus.c:571             0.00010             2         49.54
> BQL mutex  0x564dc03922e0  accel/tcg/cputlb.c:1426        0.00006           467          0.14
> condvar    0x564dc03903a0  cpus-common.c:206              0.00004             8          5.28
> ---------------------------------------------------------------------------------------------
>
>
> * ast2500-evb,execute-in-place=true :
>
> Type               Object  Call site                Wait Time (s)         Count  Average (us)
> ---------------------------------------------------------------------------------------------
> condvar    0x55a581137f88  util/thread-pool.c:90         10.01158            28     357556.50
> BQL mutex  0x55a57f0e02e0  accel/tcg/cputlb.c:1365        0.29886      14394475          0.02
> condvar    0x55a5814cb5a0  softmmu/cpus.c:423             0.02182             2      10912.44
> BQL mutex  0x55a57f0e02e0  util/rcu.c:269                 0.01420             4       3549.56
> mutex      0x55a5813381c0  tcg/region.c:204               0.00007          3052          0.02
> condvar    0x55a57f0e0280  softmmu/cpus.c:642             0.00006             1         59.79
> mutex      [           2]  chardev/char.c:118             0.00003          1492          0.02
> BQL mutex  0x55a57f0e02e0  util/main-loop.c:318           0.00002            34          0.72
> BQL mutex  0x55a57f0e02e0  accel/tcg/cputlb.c:1426        0.00002           973          0.02
> condvar    0x55a57f0e0240  softmmu/cpus.c:571             0.00002             1         15.16
> ---------------------------------------------------------------------------------------------
>
> C.
>
>
>
>> * execute-in-place=true
>> Each sample counts as 0.01 seconds.
>>    %   cumulative   self              self     total
>>   time   seconds   seconds    calls  ns/call  ns/call  name
>> 100.00      0.02     0.02   164276   121.75   121.75  memory_region_init_rom_device
>>    0.00      0.02     0.00 1610346008     0.00     0.00  tcg_code_capacity
>>    0.00      0.02     0.00 567612621     0.00     0.00  type_register_static_array
>>    0.00      0.02     0.00 328886191     0.00     0.00  do_common_semihosting
>>    0.00      0.02     0.00 297215811     0.00     0.00  container_get
>>    0.00      0.02     0.00 292670030     0.00     0.00  arm_cpu_tlb_fill
>>    0.00      0.02     0.00 195416119     0.00     0.00  arm_cpu_register_gdb_regs_for_features
>>    0.00      0.02     0.00 193326677     0.00     0.00  object_type_get_instance_size
>>    0.00      0.02     0.00 182365829     0.00     0.00  tcg_op_insert_after
>>    0.00      0.02     0.00 150668458     0.00     0.00  plugin_gen_tb_end
>>    0.00      0.02     0.00 142171940     0.00     0.00  gen_new_label
>>    0.00      0.02     0.00 133200628     0.00     0.00  smbios_build_type_38_table
>>    0.00      0.02     0.00 130540338     0.00     0.00  object_dynamic_cast_assert
>>    0.00      0.02     0.00 129223195     0.00     0.00  cpu_loop_exit_atomic
>>    0.00      0.02     0.00 121759298     0.00     0.00  tcg_remove_ops_after
>>    0.00      0.02     0.00 116887887     0.00     0.00  in_code_gen_buffer
>>    0.00      0.02     0.00 111803833     0.00     0.00  tcg_emit_op
>>    0.00      0.02     0.00 106052221     0.00     0.00  object_class_dynamic_cast_assert
>>    0.00      0.02     0.00 99704054     0.00     0.00  __jit_debug_register_code
>>    0.00      0.02     0.00 97812458     0.00     0.00  object_get_class
>>    0.00      0.02     0.00 88952594     0.00     0.00  tcg_splitwx_to_rx
>>    0.00      0.02     0.00 85790920     0.00     0.00  object_class_dynamic_cast
>>    0.00      0.02     0.00 73780673     0.00     0.00  helper_exit_atomic
>>    0.00      0.02     0.00 65337482     0.00     0.00  tcg_op_supported
>>    0.00      0.02     0.00 61213619     0.00     0.00  tcg_func_start
>>    0.00      0.02     0.00 54477684     0.00     0.00  tcg_flush_softmmu_tlb
>>    0.00      0.02     0.00 53968980     0.00     0.00  tcg_temp_new_internal
>>    0.00      0.02     0.00 51526008     0.00     0.00  qemu_in_vcpu_thread
>>    0.00      0.02     0.00 40750952     0.00     0.00  pflash_cfi02_register
>>    0.00      0.02     0.00 38039442     0.00     0.00  tcg_gen_op2
>>    0.00      0.02     0.00 37068039     0.00     0.00  tcg_gen_op1
>>    0.00      0.02     0.00 36473276     0.00     0.00  tcg_gen_op3
>>    0.00      0.02     0.00 36310225     0.00     0.00  gen_gvec_uaba
>>    0.00      0.02     0.00 30985436     0.00     0.00  tb_set_jmp_target
>>    0.00      0.02     0.00 30291796     0.00     0.00  tcg_constant_internal
>>    0.00      0.02     0.00 29857950     0.00     0.00  ssi_transfer
>> * execute-in-place=false
>> Each sample counts as 0.01 seconds.
>>    %   cumulative   self              self     total
>>   time   seconds   seconds    calls  ns/call  ns/call  name
>>   40.00      0.02     0.02   551149    36.29    36.29  aspeed_board_init_flashes
>>   20.00      0.03     0.01  3937238     2.54     2.54  register_cp_regs_for_features
>>   20.00      0.04     0.01   674096    14.83    14.83  gen_gvec_uaba
>>   20.00      0.05     0.01   457461    21.86    21.86  finalize_target_page_bits
>>    0.00      0.05     0.00  5364258     0.00     0.00  arm_gt_hvtimer_cb
>>    0.00      0.05     0.00  2467532     0.00     0.00  helper_neon_narrow_sat_s8
>>    0.00      0.05     0.00  2431860     0.00     0.00  opb_opb2fsi_address
>>    0.00      0.05     0.00  1828453     0.00     0.00  cpsr_read
>>    0.00      0.05     0.00  1820659     0.00     0.00  cpu_get_tb_cpu_state
>>    0.00      0.05     0.00  1441344     0.00     0.00  arm_cpu_tlb_fill
>>    0.00      0.05     0.00  1427177     0.00     0.00  cxl_usp_to_cstate
>>    0.00      0.05     0.00  1161059     0.00     5.85  aarch64_sync_64_to_32
>>    0.00      0.05     0.00   886523     0.00     0.00  helper_iwmmxt_maxsb
>>    0.00      0.05     0.00   831393     0.00     0.00  arm_log_exception
>>    0.00      0.05     0.00   746940     0.00     0.00  helper_v7m_preserve_fp_state
>>    0.00      0.05     0.00   728354     0.00     0.00  hmp_calc_dirty_rate
>>    0.00      0.05     0.00   681634     0.00     0.00  helper_sadd8
>>    0.00      0.05     0.00   487743     0.00     7.14  qmp_query_cpu_definitions
>>    0.00      0.05     0.00   420528     0.00     0.00  arm_v7m_cpu_do_interrupt
>>    0.00      0.05     0.00   382245     0.00     0.00  helper_ssub8
>>    0.00      0.05     0.00   374192     0.00     0.00  helper_usub8
>>    0.00      0.05     0.00   347199     0.00     0.00  usb_msd_load_request
>>    0.00      0.05     0.00   325862     0.00     0.00  target_disas
>>    0.00      0.05     0.00   322375     0.00     0.00  arm_hcrx_el2_eff
>>    0.00      0.05     0.00   317835     0.00     0.00  virtio_bus_device_iommu_enabled
>>    0.00      0.05     0.00   309559     0.00     0.00  mig_throttle_counter_reset
>>    0.00      0.05     0.00   301557     0.00     0.00  ram_bytes_remaining
>>    0.00      0.05     0.00   292888     0.00     0.00  helper_v7m_blxns
>>    0.00      0.05     0.00   289093     0.00     0.00  tpm_util_show_buffer
>>    0.00      0.05     0.00   274156     0.00     0.00  helper_sxtb16
>>    0.00      0.05     0.00   273588     0.00     0.00  write_v7m_exception
>>    0.00      0.05     0.00   271619     0.00     0.00  page_size_init
>>    0.00      0.05     0.00   270247     0.00     0.00  qemu_fdt_setprop_sized_cells_from_array
>>    0.00      0.05     0.00   229643     0.00    14.69  helper_neon_addl_u32


-- 
Alex Bennée


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 12/14] aspeed: Make aspeed_board_init_flashes public
  2022-06-29 14:14           ` Alex Bennée
@ 2022-06-29 15:54             ` Cédric Le Goater
  2022-06-29 18:24               ` Alex Bennée
  0 siblings, 1 reply; 19+ messages in thread
From: Cédric Le Goater @ 2022-06-29 15:54 UTC (permalink / raw)
  To: Alex Bennée
  Cc: Peter Delevoryas, Peter Maydell, Andrew Jeffery, Joel Stanley,
	pbonzini, berrange, eduardo, marcel.apfelbaum, richard.henderson,
	Philippe Mathieu-Daudé,
	ani, Cameron Esfahani via, qemu-arm

On 6/29/22 16:14, Alex Bennée wrote:
> 
> Cédric Le Goater <clg@kaod.org> writes:
> 
>> On 6/24/22 18:50, Cédric Le Goater wrote:
>>> On 6/23/22 20:43, Peter Delevoryas wrote:
>>>>
>>>>
>>>>> On Jun 23, 2022, at 8:09 AM, Cédric Le Goater <clg@kaod.org> wrote:
>>>>>
>>>>> On 6/23/22 12:26, Peter Delevoryas wrote:
>>>>>> Signed-off-by: Peter Delevoryas <pdel@fb.com>
>>>>>
>>>>> Let's start simple without flash support. We should be able to
>>>>> load FW blobs in each CPU address space using loader devices.
>>>>
>>>> Actually, I was unable to do this, perhaps because the fb OpenBMC
>>>> boot sequence is a little weird. I specifically _needed_ to have
>>>> a flash device which maps the firmware in at 0x2000_0000, because
>>>> the fb OpenBMC U-Boot SPL jumps to that address to start executing
>>>> from flash? I think this is also why fb OpenBMC machines can be so slow.
>>>>
>>>> $ ./build/qemu-system-arm -machine fby35 \
>>>>       -device loader,file=fby35.mtd,addr=0,cpu-num=0 -nographic \
>>>>       -d int -drive file=fby35.mtd,format=raw,if=mtd
>>> Ideally we should be booting from the flash device directly using
>>> the machine option '-M ast2600-evb,execute-in-place=true' like HW
>>> does. Instructions are fetched using SPI transfers. But the amount
>>> of code generated is tremendous.
> 
> Yeah because there is a potential race when reading from HW so we throw
> away TB's after executing them because we have no way of knowing if it
> has changed under our feet. See 873d64ac30 (accel/tcg: re-factor non-RAM
> execution code) which cleaned up this handling.
> 
>>> See some profiling below for a
>>> run which barely reaches DRAM training in U-Boot.
>>
>> Some more profiling on both ast2500 and ast2600 machines shows :
>>
>>
>> * ast2600-evb,execute-in-place=true :
>>
>> Type               Object  Call site                Wait Time (s)         Count  Average (us)
>> ---------------------------------------------------------------------------------------------
>> BQL mutex  0x564dc03922e0  accel/tcg/cputlb.c:1365       14.21443
>> 32909927          0.43
> 
> This is unavoidable as a HW access needs the BQL held so we will go
> through this cycle every executed instruction.
> 
> Did I miss why the flash contents are not mapped into the physical
> address space? Isn't that how it appear to the processor?


There are two modes :
  
         if (ASPEED_MACHINE(machine)->mmio_exec) {
             memory_region_init_alias(boot_rom, NULL, "aspeed.boot_rom",
                                      &fl->mmio, 0, size);
             memory_region_add_subregion(get_system_memory(), FIRMWARE_ADDR,
                                         boot_rom);
         } else {
             memory_region_init_rom(boot_rom, NULL, "aspeed.boot_rom",
                                    size, &error_abort);
             memory_region_add_subregion(get_system_memory(), FIRMWARE_ADDR,
                                         boot_rom);
             write_boot_rom(drive0, FIRMWARE_ADDR, size, &error_abort);
         }

The default boot mode uses the ROM. No issue.

The "execute-in-place=true" option creates an alias on the region of
the flash contents and each instruction is then fetched from the flash
drive with SPI transactions.

With old FW images, using an older U-boot, the machine boots in a couple
of seconds. See the profiling below for a witherspoon-bmc machine using
U-Boot 2016.07.

   qemu-system-arm -M witherspoon-bmc,execute-in-place=true  -drive file=./flash-witherspoon-bmc,format=raw,if=mtd -drive file=./flash-witherspoon-bmc2,format=raw,if=mtd -nographic -nodefaults -snapshot -serial mon:stdio -enable-sync-profile
   ...
   U-Boot 2016.07-00040-g8425e96e2e27-dirty (Jun 24 2022 - 23:21:57 +0200)
   
          Watchdog enabled
   DRAM:  496 MiB
   Flash: 32 MiB
   In:    serial
   Out:   serial
   Err:   serial
   Net:
   (qemu) info sync-profile
   Type               Object  Call site                Wait Time (s)         Count  Average (us)
   ---------------------------------------------------------------------------------------------
   BQL mutex  0x56189610b2e0  accel/tcg/cputlb.c:1365        0.25311      12346237          0.02
   condvar    0x5618970cf220  softmmu/cpus.c:423             0.05506             2      27530.78
   BQL mutex  0x56189610b2e0  util/rcu.c:269                 0.04709             2      23544.26
   condvar    0x561896d0fc78  util/thread-pool.c:90          0.01340            83        161.47
   condvar    0x56189610b240  softmmu/cpus.c:571             0.00005             1         54.93
   condvar    0x56189610b280  softmmu/cpus.c:642             0.00003             1         32.88
   BQL mutex  0x56189610b2e0  util/main-loop.c:318           0.00003            34          0.76
   mutex      0x561896eade00  tcg/region.c:204               0.00002           995          0.02
   rec_mutex  [           2]  util/async.c:682               0.00002           493          0.03
   mutex      [           2]  chardev/char.c:118             0.00001           404          0.03
   ---------------------------------------------------------------------------------------------


However, with recent U-boots, it takes quite a while to reach DRAM training.
Close to a minute. See the profiling below for an ast2500-evb machine using
U-Boot 2019.04.

  qemu-system-arm -M ast2500-evb,execute-in-place=true  -net nic,macaddr=C0:FF:EE:00:00:03,netdev=net0  -drive file=./flash-ast2500-evb,format=raw,if=mtd  -nographic -nodefaults -snapshot -serial mon:stdio  -enable-sync-profile
   qemu-system-arm: warning: Aspeed iBT has no chardev backend
   qemu-system-arm: warning: nic ftgmac100.1 has no peer
   QEMU 7.0.50 monitor - type 'help' for more information
   
   U-Boot 2019.04-00080-g6ca27db3f97b-dirty (Jun 24 2022 - 23:22:03 +0200)
   
   SOC : AST2500-A1
   RST : Power On
   LPC Mode : SIO:Enable : SuperIO-2e
   Eth : MAC0: RGMII, , MAC1: RGMII,
   Model: AST2500 EVB
   DRAM:  448 MiB (capacity:512 MiB, VGA:64 MiB, ECC:off)
   MMC:   sdhci_slot0@100: 0, sdhci_slot1@200: 1
   Loading Environment from SPI Flash... SF: Detected mx25l25635e with page size 256 Bytes, erase size 64 KiB, total 32 MiB
   *** Warning - bad CRC, using default environment
   
   In:    serial@1e784000
   Out:   serial@1e784000
   Err:   serial@1e784000
   Net:   eth0: ethernet@1e660000
   Warning: ethernet@1e680000 (eth1) using random MAC address - 4a:e5:9a:4a:c7:c5
   , eth1: ethernet@1e680000
   Hit any key to stop autoboot:  2
   (qemu) info sync-profile
   Type               Object  Call site                Wait Time (s)         Count  Average (us)
   ---------------------------------------------------------------------------------------------
   condvar    0x561f10c9ef88  util/thread-pool.c:90         10.01196            28     357570.00
   BQL mutex  0x561f102362e0  accel/tcg/cputlb.c:1365        0.29496      14248621          0.02
   condvar    0x561f110325a0  softmmu/cpus.c:423             0.02231             2      11152.57
   BQL mutex  0x561f102362e0  util/rcu.c:269                 0.01447             4       3618.60
   condvar    0x561f10236240  softmmu/cpus.c:571             0.00010             1        102.19
   mutex      0x561f10e9f1c0  tcg/region.c:204               0.00007          3052          0.02
   mutex      [           2]  chardev/char.c:118             0.00003          1486          0.02
   condvar    0x561f10236280  softmmu/cpus.c:642             0.00003             1         29.38
   BQL mutex  0x561f102362e0  accel/tcg/cputlb.c:1426        0.00002           973          0.02
   BQL mutex  0x561f102362e0  util/main-loop.c:318           0.00001            34          0.41
   ---------------------------------------------------------------------------------------------
   

Something in the layout of the FW is making a big difference. One
that could be relevant is that the recent versions are using a device
tree.

There might be no good solution to this issue but I fail to analyze
it correctly. Is there a way to collect information on the usage of
Translation Blocks ?

Thanks,

C.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 12/14] aspeed: Make aspeed_board_init_flashes public
  2022-06-29 15:54             ` Cédric Le Goater
@ 2022-06-29 18:24               ` Alex Bennée
  2022-06-30  8:49                 ` Cédric Le Goater
  0 siblings, 1 reply; 19+ messages in thread
From: Alex Bennée @ 2022-06-29 18:24 UTC (permalink / raw)
  To: Cédric Le Goater
  Cc: Peter Delevoryas, Peter Maydell, Andrew Jeffery, Joel Stanley,
	pbonzini, berrange, eduardo, marcel.apfelbaum, richard.henderson,
	Philippe Mathieu-Daudé,
	ani, Cameron Esfahani via, qemu-arm


Cédric Le Goater <clg@kaod.org> writes:

> On 6/29/22 16:14, Alex Bennée wrote:
>> Cédric Le Goater <clg@kaod.org> writes:
>> 
>>> On 6/24/22 18:50, Cédric Le Goater wrote:
>>>> On 6/23/22 20:43, Peter Delevoryas wrote:
>>>>>
>>>>>
>>>>>> On Jun 23, 2022, at 8:09 AM, Cédric Le Goater <clg@kaod.org> wrote:
>>>>>>
>>>>>> On 6/23/22 12:26, Peter Delevoryas wrote:
>>>>>>> Signed-off-by: Peter Delevoryas <pdel@fb.com>
>>>>>>
>>>>>> Let's start simple without flash support. We should be able to
>>>>>> load FW blobs in each CPU address space using loader devices.
>>>>>
>>>>> Actually, I was unable to do this, perhaps because the fb OpenBMC
>>>>> boot sequence is a little weird. I specifically _needed_ to have
>>>>> a flash device which maps the firmware in at 0x2000_0000, because
>>>>> the fb OpenBMC U-Boot SPL jumps to that address to start executing
>>>>> from flash? I think this is also why fb OpenBMC machines can be so slow.
>>>>>
>>>>> $ ./build/qemu-system-arm -machine fby35 \
>>>>>       -device loader,file=fby35.mtd,addr=0,cpu-num=0 -nographic \
>>>>>       -d int -drive file=fby35.mtd,format=raw,if=mtd
>>>> Ideally we should be booting from the flash device directly using
>>>> the machine option '-M ast2600-evb,execute-in-place=true' like HW
>>>> does. Instructions are fetched using SPI transfers. But the amount
>>>> of code generated is tremendous.
>> Yeah because there is a potential race when reading from HW so we
>> throw
>> away TB's after executing them because we have no way of knowing if it
>> has changed under our feet. See 873d64ac30 (accel/tcg: re-factor non-RAM
>> execution code) which cleaned up this handling.
>> 
>>>> See some profiling below for a
>>>> run which barely reaches DRAM training in U-Boot.
>>>
>>> Some more profiling on both ast2500 and ast2600 machines shows :
>>>
>>>
>>> * ast2600-evb,execute-in-place=true :
>>>
>>> Type               Object  Call site                Wait Time (s)         Count  Average (us)
>>> ---------------------------------------------------------------------------------------------
>>> BQL mutex  0x564dc03922e0  accel/tcg/cputlb.c:1365       14.21443
>>> 32909927          0.43
>> This is unavoidable as a HW access needs the BQL held so we will go
>> through this cycle every executed instruction.
>> Did I miss why the flash contents are not mapped into the physical
>> address space? Isn't that how it appear to the processor?
>
>
> There are two modes :
>           if (ASPEED_MACHINE(machine)->mmio_exec) {
>             memory_region_init_alias(boot_rom, NULL, "aspeed.boot_rom",
>                                      &fl->mmio, 0, size);
>             memory_region_add_subregion(get_system_memory(), FIRMWARE_ADDR,
>                                         boot_rom);
>         } else {
>             memory_region_init_rom(boot_rom, NULL, "aspeed.boot_rom",
>                                    size, &error_abort);
>             memory_region_add_subregion(get_system_memory(), FIRMWARE_ADDR,
>                                         boot_rom);
>             write_boot_rom(drive0, FIRMWARE_ADDR, size, &error_abort);
>         }
>
> The default boot mode uses the ROM. No issue.
>
> The "execute-in-place=true" option creates an alias on the region of
> the flash contents and each instruction is then fetched from the flash
> drive with SPI transactions.
>
> With old FW images, using an older U-boot, the machine boots in a couple
> of seconds. See the profiling below for a witherspoon-bmc machine using
> U-Boot 2016.07.
>
>   qemu-system-arm -M witherspoon-bmc,execute-in-place=true  -drive file=./flash-witherspoon-bmc,format=raw,if=mtd -drive file=./flash-witherspoon-bmc2,format=raw,if=mtd -nographic -nodefaults -snapshot -serial mon:stdio -enable-sync-profile
>   ...
>   U-Boot 2016.07-00040-g8425e96e2e27-dirty (Jun 24 2022 - 23:21:57 +0200)
>             Watchdog enabled
>   DRAM:  496 MiB
>   Flash: 32 MiB
>   In:    serial
>   Out:   serial
>   Err:   serial
>   Net:
>   (qemu) info sync-profile
>   Type               Object  Call site                Wait Time (s)         Count  Average (us)
>   ---------------------------------------------------------------------------------------------
>   BQL mutex  0x56189610b2e0  accel/tcg/cputlb.c:1365        0.25311      12346237          0.02
>   condvar    0x5618970cf220  softmmu/cpus.c:423             0.05506             2      27530.78
>   BQL mutex  0x56189610b2e0  util/rcu.c:269                 0.04709             2      23544.26
>   condvar    0x561896d0fc78  util/thread-pool.c:90          0.01340            83        161.47
>   condvar    0x56189610b240  softmmu/cpus.c:571             0.00005             1         54.93
>   condvar    0x56189610b280  softmmu/cpus.c:642             0.00003             1         32.88
>   BQL mutex  0x56189610b2e0  util/main-loop.c:318           0.00003            34          0.76
>   mutex      0x561896eade00  tcg/region.c:204               0.00002           995          0.02
>   rec_mutex  [           2]  util/async.c:682               0.00002           493          0.03
>   mutex      [           2]  chardev/char.c:118             0.00001           404          0.03
>   ---------------------------------------------------------------------------------------------
>
>
> However, with recent U-boots, it takes quite a while to reach DRAM training.
> Close to a minute. See the profiling below for an ast2500-evb machine using
> U-Boot 2019.04.
>
>  qemu-system-arm -M ast2500-evb,execute-in-place=true  -net nic,macaddr=C0:FF:EE:00:00:03,netdev=net0  -drive file=./flash-ast2500-evb,format=raw,if=mtd  -nographic -nodefaults -snapshot -serial mon:stdio  -enable-sync-profile
>   qemu-system-arm: warning: Aspeed iBT has no chardev backend
>   qemu-system-arm: warning: nic ftgmac100.1 has no peer
>   QEMU 7.0.50 monitor - type 'help' for more information
>      U-Boot 2019.04-00080-g6ca27db3f97b-dirty (Jun 24 2022 - 23:22:03
>     +0200)
>      SOC : AST2500-A1
>   RST : Power On
>   LPC Mode : SIO:Enable : SuperIO-2e
>   Eth : MAC0: RGMII, , MAC1: RGMII,
>   Model: AST2500 EVB
>   DRAM:  448 MiB (capacity:512 MiB, VGA:64 MiB, ECC:off)
>   MMC:   sdhci_slot0@100: 0, sdhci_slot1@200: 1
>   Loading Environment from SPI Flash... SF: Detected mx25l25635e with page size 256 Bytes, erase size 64 KiB, total 32 MiB
>   *** Warning - bad CRC, using default environment
>      In:    serial@1e784000
>   Out:   serial@1e784000
>   Err:   serial@1e784000
>   Net:   eth0: ethernet@1e660000
>   Warning: ethernet@1e680000 (eth1) using random MAC address - 4a:e5:9a:4a:c7:c5
>   , eth1: ethernet@1e680000
>   Hit any key to stop autoboot:  2
>   (qemu) info sync-profile
>   Type               Object  Call site                Wait Time (s)         Count  Average (us)
>   ---------------------------------------------------------------------------------------------
>   condvar    0x561f10c9ef88  util/thread-pool.c:90         10.01196            28     357570.00
>   BQL mutex  0x561f102362e0  accel/tcg/cputlb.c:1365        0.29496      14248621          0.02
>   condvar    0x561f110325a0  softmmu/cpus.c:423             0.02231             2      11152.57
>   BQL mutex  0x561f102362e0  util/rcu.c:269                 0.01447             4       3618.60
>   condvar    0x561f10236240  softmmu/cpus.c:571             0.00010             1        102.19
>   mutex      0x561f10e9f1c0  tcg/region.c:204               0.00007          3052          0.02
>   mutex      [           2]  chardev/char.c:118             0.00003          1486          0.02
>   condvar    0x561f10236280  softmmu/cpus.c:642             0.00003             1         29.38
>   BQL mutex  0x561f102362e0  accel/tcg/cputlb.c:1426        0.00002           973          0.02
>   BQL mutex  0x561f102362e0  util/main-loop.c:318           0.00001            34          0.41
>   ---------------------------------------------------------------------------------------------
>   Something in the layout of the FW is making a big difference. One
> that could be relevant is that the recent versions are using a device
> tree.
>
> There might be no good solution to this issue but I fail to analyze
> it correctly. Is there a way to collect information on the usage of
> Translation Blocks ?

You could expand the data we collect in tb_tree_stats and expose it via
info jit.

>
> Thanks,
>
> C.


-- 
Alex Bennée


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 12/14] aspeed: Make aspeed_board_init_flashes public
  2022-06-29 18:24               ` Alex Bennée
@ 2022-06-30  8:49                 ` Cédric Le Goater
  2022-06-30  9:43                   ` Alex Bennée
  0 siblings, 1 reply; 19+ messages in thread
From: Cédric Le Goater @ 2022-06-30  8:49 UTC (permalink / raw)
  To: Alex Bennée
  Cc: Peter Delevoryas, Peter Maydell, Andrew Jeffery, Joel Stanley,
	pbonzini, berrange, eduardo, marcel.apfelbaum, richard.henderson,
	Philippe Mathieu-Daudé,
	ani, Cameron Esfahani via, qemu-arm

On 6/29/22 20:24, Alex Bennée wrote:
> 
> Cédric Le Goater <clg@kaod.org> writes:
> 
>> On 6/29/22 16:14, Alex Bennée wrote:
>>> Cédric Le Goater <clg@kaod.org> writes:
>>>
>>>> On 6/24/22 18:50, Cédric Le Goater wrote:
>>>>> On 6/23/22 20:43, Peter Delevoryas wrote:
>>>>>>
>>>>>>
>>>>>>> On Jun 23, 2022, at 8:09 AM, Cédric Le Goater <clg@kaod.org> wrote:
>>>>>>>
>>>>>>> On 6/23/22 12:26, Peter Delevoryas wrote:
>>>>>>>> Signed-off-by: Peter Delevoryas <pdel@fb.com>
>>>>>>>
>>>>>>> Let's start simple without flash support. We should be able to
>>>>>>> load FW blobs in each CPU address space using loader devices.
>>>>>>
>>>>>> Actually, I was unable to do this, perhaps because the fb OpenBMC
>>>>>> boot sequence is a little weird. I specifically _needed_ to have
>>>>>> a flash device which maps the firmware in at 0x2000_0000, because
>>>>>> the fb OpenBMC U-Boot SPL jumps to that address to start executing
>>>>>> from flash? I think this is also why fb OpenBMC machines can be so slow.
>>>>>>
>>>>>> $ ./build/qemu-system-arm -machine fby35 \
>>>>>>        -device loader,file=fby35.mtd,addr=0,cpu-num=0 -nographic \
>>>>>>        -d int -drive file=fby35.mtd,format=raw,if=mtd
>>>>> Ideally we should be booting from the flash device directly using
>>>>> the machine option '-M ast2600-evb,execute-in-place=true' like HW
>>>>> does. Instructions are fetched using SPI transfers. But the amount
>>>>> of code generated is tremendous.
>>> Yeah because there is a potential race when reading from HW so we
>>> throw
>>> away TB's after executing them because we have no way of knowing if it
>>> has changed under our feet. See 873d64ac30 (accel/tcg: re-factor non-RAM
>>> execution code) which cleaned up this handling.
>>>
>>>>> See some profiling below for a
>>>>> run which barely reaches DRAM training in U-Boot.
>>>>
>>>> Some more profiling on both ast2500 and ast2600 machines shows :
>>>>
>>>>
>>>> * ast2600-evb,execute-in-place=true :
>>>>
>>>> Type               Object  Call site                Wait Time (s)         Count  Average (us)
>>>> ---------------------------------------------------------------------------------------------
>>>> BQL mutex  0x564dc03922e0  accel/tcg/cputlb.c:1365       14.21443
>>>> 32909927          0.43
>>> This is unavoidable as a HW access needs the BQL held so we will go
>>> through this cycle every executed instruction.
>>> Did I miss why the flash contents are not mapped into the physical
>>> address space? Isn't that how it appear to the processor?
>>
>>
>> There are two modes :
>>            if (ASPEED_MACHINE(machine)->mmio_exec) {
>>              memory_region_init_alias(boot_rom, NULL, "aspeed.boot_rom",
>>                                       &fl->mmio, 0, size);
>>              memory_region_add_subregion(get_system_memory(), FIRMWARE_ADDR,
>>                                          boot_rom);
>>          } else {
>>              memory_region_init_rom(boot_rom, NULL, "aspeed.boot_rom",
>>                                     size, &error_abort);
>>              memory_region_add_subregion(get_system_memory(), FIRMWARE_ADDR,
>>                                          boot_rom);
>>              write_boot_rom(drive0, FIRMWARE_ADDR, size, &error_abort);
>>          }
>>
>> The default boot mode uses the ROM. No issue.
>>
>> The "execute-in-place=true" option creates an alias on the region of
>> the flash contents and each instruction is then fetched from the flash
>> drive with SPI transactions.
>>
>> With old FW images, using an older U-boot, the machine boots in a couple
>> of seconds. See the profiling below for a witherspoon-bmc machine using
>> U-Boot 2016.07.
>>
>>    qemu-system-arm -M witherspoon-bmc,execute-in-place=true  -drive file=./flash-witherspoon-bmc,format=raw,if=mtd -drive file=./flash-witherspoon-bmc2,format=raw,if=mtd -nographic -nodefaults -snapshot -serial mon:stdio -enable-sync-profile
>>    ...
>>    U-Boot 2016.07-00040-g8425e96e2e27-dirty (Jun 24 2022 - 23:21:57 +0200)
>>              Watchdog enabled
>>    DRAM:  496 MiB
>>    Flash: 32 MiB
>>    In:    serial
>>    Out:   serial
>>    Err:   serial
>>    Net:
>>    (qemu) info sync-profile
>>    Type               Object  Call site                Wait Time (s)         Count  Average (us)
>>    ---------------------------------------------------------------------------------------------
>>    BQL mutex  0x56189610b2e0  accel/tcg/cputlb.c:1365        0.25311      12346237          0.02
>>    condvar    0x5618970cf220  softmmu/cpus.c:423             0.05506             2      27530.78
>>    BQL mutex  0x56189610b2e0  util/rcu.c:269                 0.04709             2      23544.26
>>    condvar    0x561896d0fc78  util/thread-pool.c:90          0.01340            83        161.47
>>    condvar    0x56189610b240  softmmu/cpus.c:571             0.00005             1         54.93
>>    condvar    0x56189610b280  softmmu/cpus.c:642             0.00003             1         32.88
>>    BQL mutex  0x56189610b2e0  util/main-loop.c:318           0.00003            34          0.76
>>    mutex      0x561896eade00  tcg/region.c:204               0.00002           995          0.02
>>    rec_mutex  [           2]  util/async.c:682               0.00002           493          0.03
>>    mutex      [           2]  chardev/char.c:118             0.00001           404          0.03
>>    ---------------------------------------------------------------------------------------------
>>
>>
>> However, with recent U-boots, it takes quite a while to reach DRAM training.
>> Close to a minute. See the profiling below for an ast2500-evb machine using
>> U-Boot 2019.04.
>>
>>   qemu-system-arm -M ast2500-evb,execute-in-place=true  -net nic,macaddr=C0:FF:EE:00:00:03,netdev=net0  -drive file=./flash-ast2500-evb,format=raw,if=mtd  -nographic -nodefaults -snapshot -serial mon:stdio  -enable-sync-profile
>>    qemu-system-arm: warning: Aspeed iBT has no chardev backend
>>    qemu-system-arm: warning: nic ftgmac100.1 has no peer
>>    QEMU 7.0.50 monitor - type 'help' for more information
>>       U-Boot 2019.04-00080-g6ca27db3f97b-dirty (Jun 24 2022 - 23:22:03
>>      +0200)
>>       SOC : AST2500-A1
>>    RST : Power On
>>    LPC Mode : SIO:Enable : SuperIO-2e
>>    Eth : MAC0: RGMII, , MAC1: RGMII,
>>    Model: AST2500 EVB
>>    DRAM:  448 MiB (capacity:512 MiB, VGA:64 MiB, ECC:off)
>>    MMC:   sdhci_slot0@100: 0, sdhci_slot1@200: 1
>>    Loading Environment from SPI Flash... SF: Detected mx25l25635e with page size 256 Bytes, erase size 64 KiB, total 32 MiB
>>    *** Warning - bad CRC, using default environment
>>       In:    serial@1e784000
>>    Out:   serial@1e784000
>>    Err:   serial@1e784000
>>    Net:   eth0: ethernet@1e660000
>>    Warning: ethernet@1e680000 (eth1) using random MAC address - 4a:e5:9a:4a:c7:c5
>>    , eth1: ethernet@1e680000
>>    Hit any key to stop autoboot:  2
>>    (qemu) info sync-profile
>>    Type               Object  Call site                Wait Time (s)         Count  Average (us)
>>    ---------------------------------------------------------------------------------------------
>>    condvar    0x561f10c9ef88  util/thread-pool.c:90         10.01196            28     357570.00
>>    BQL mutex  0x561f102362e0  accel/tcg/cputlb.c:1365        0.29496      14248621          0.02
>>    condvar    0x561f110325a0  softmmu/cpus.c:423             0.02231             2      11152.57
>>    BQL mutex  0x561f102362e0  util/rcu.c:269                 0.01447             4       3618.60
>>    condvar    0x561f10236240  softmmu/cpus.c:571             0.00010             1        102.19
>>    mutex      0x561f10e9f1c0  tcg/region.c:204               0.00007          3052          0.02
>>    mutex      [           2]  chardev/char.c:118             0.00003          1486          0.02
>>    condvar    0x561f10236280  softmmu/cpus.c:642             0.00003             1         29.38
>>    BQL mutex  0x561f102362e0  accel/tcg/cputlb.c:1426        0.00002           973          0.02
>>    BQL mutex  0x561f102362e0  util/main-loop.c:318           0.00001            34          0.41
>>    ---------------------------------------------------------------------------------------------
>>    Something in the layout of the FW is making a big difference. One
>> that could be relevant is that the recent versions are using a device
>> tree.
>>
>> There might be no good solution to this issue but I fail to analyze
>> it correctly. Is there a way to collect information on the usage of
>> Translation Blocks ?
> 
> You could expand the data we collect in tb_tree_stats and expose it via
> info jit.

The "fast" run, U-Boot 2016.07, gives :


   Translation buffer state:
   gen code size       254880371/1073736704
   TB count            1089
   TB avg target size  16 max=356 bytes
   TB avg host size    278 bytes (expansion ratio: 17.2)
   cross page TB count 0 (0%)
   direct jump count   501 (46%) (2 jumps=372 34%)
   TB hash buckets     1025/8192 (12.51% head buckets used)
   TB hash occupancy   3.32% avg chain occ. Histogram: [0.0,7.5)%|█  ▁  ▁  ▁|[67.5,75.0]%
   TB hash avg chain   1.000 buckets. Histogram: 1|█|1
   
   Statistics:
   TB flush count      0
   TB invalidate count 0
   TLB full flushes    0
   TLB partial flushes 2
   TLB elided flushes  2338
   JIT cycles          2221788409 (0.926 s at 2.4 GHz)
   translated TBs      738520 (aborted=0 0.0%)
   avg ops/TB          15.7 max=459
   deleted ops/TB      2.72
   avg temps/TB        32.89 max=88
   avg host code/TB    113.7
   avg search data/TB  5.2
   cycles/op           192.0
   cycles/in byte      748.7
   cycles/out byte     26.4
   cycles/search byte     582.8
     gen_interm time   57.6%
     gen_code time     42.4%
   optim./code time    19.4%
   liveness/code time  26.1%
   cpu_restore count   0
     avg cycles        0.0
   

and the "slow", U-Boot 2019.04 :

Translation buffer state:
gen code size       368603795/1073736704
TB count            3052
TB avg target size  16 max=360 bytes
TB avg host size    293 bytes (expansion ratio: 17.6)
cross page TB count 0 (0%)
direct jump count   1431 (46%) (2 jumps=1104 36%)
TB hash buckets     2559/8192 (31.24% head buckets used)
TB hash occupancy   9.31% avg chain occ. Histogram: [0,10)%|█ ▃  ▁ ▁ ▁|[90,100]%
TB hash avg chain   1.000 buckets. Histogram: 1|█|1

Statistics:
TB flush count      3
TB invalidate count 0
TLB full flushes    0
TLB partial flushes 3
TLB elided flushes  2367
JIT cycles          26479044772 (11.033 s at 2.4 GHz)
translated TBs      10552169 (aborted=0 0.0%)
avg ops/TB          15.0 max=464
deleted ops/TB      2.44
avg temps/TB        32.43 max=89
avg host code/TB    99.0
avg search data/TB  5.0
cycles/op           167.7
cycles/in byte      626.8
cycles/out byte     25.4
cycles/search byte     499.4
   gen_interm time   50.4%
   gen_code time     49.6%
optim./code time    19.5%
liveness/code time  27.7%
cpu_restore count   0
   avg cycles        0.0


A lot more TBs.
   
C.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 12/14] aspeed: Make aspeed_board_init_flashes public
  2022-06-30  8:49                 ` Cédric Le Goater
@ 2022-06-30  9:43                   ` Alex Bennée
  2022-07-05 12:35                     ` Cédric Le Goater
  0 siblings, 1 reply; 19+ messages in thread
From: Alex Bennée @ 2022-06-30  9:43 UTC (permalink / raw)
  To: Cédric Le Goater
  Cc: Peter Delevoryas, Peter Maydell, Andrew Jeffery, Joel Stanley,
	pbonzini, berrange, eduardo, marcel.apfelbaum, richard.henderson,
	Philippe Mathieu-Daudé,
	ani, Cameron Esfahani via, qemu-arm


Cédric Le Goater <clg@kaod.org> writes:

> On 6/29/22 20:24, Alex Bennée wrote:
>> Cédric Le Goater <clg@kaod.org> writes:
>> 
>>> On 6/29/22 16:14, Alex Bennée wrote:
>>>> Cédric Le Goater <clg@kaod.org> writes:
>>>>
>>>>> On 6/24/22 18:50, Cédric Le Goater wrote:
>>>>>> On 6/23/22 20:43, Peter Delevoryas wrote:
>>>>>>>
>>>>>>>
>>>>>>>> On Jun 23, 2022, at 8:09 AM, Cédric Le Goater <clg@kaod.org> wrote:
>>>>>>>>
>>>>>>>> On 6/23/22 12:26, Peter Delevoryas wrote:
>>>>>>>>> Signed-off-by: Peter Delevoryas <pdel@fb.com>
>>>>>>>>
>>>>>>>> Let's start simple without flash support. We should be able to
>>>>>>>> load FW blobs in each CPU address space using loader devices.
>>>>>>>
>>>>>>> Actually, I was unable to do this, perhaps because the fb OpenBMC
>>>>>>> boot sequence is a little weird. I specifically _needed_ to have
>>>>>>> a flash device which maps the firmware in at 0x2000_0000, because
>>>>>>> the fb OpenBMC U-Boot SPL jumps to that address to start executing
>>>>>>> from flash? I think this is also why fb OpenBMC machines can be so slow.
>>>>>>>
>>>>>>> $ ./build/qemu-system-arm -machine fby35 \
>>>>>>>        -device loader,file=fby35.mtd,addr=0,cpu-num=0 -nographic \
>>>>>>>        -d int -drive file=fby35.mtd,format=raw,if=mtd
>>>>>> Ideally we should be booting from the flash device directly using
>>>>>> the machine option '-M ast2600-evb,execute-in-place=true' like HW
>>>>>> does. Instructions are fetched using SPI transfers. But the amount
>>>>>> of code generated is tremendous.
>>>> Yeah because there is a potential race when reading from HW so we
>>>> throw
>>>> away TB's after executing them because we have no way of knowing if it
>>>> has changed under our feet. See 873d64ac30 (accel/tcg: re-factor non-RAM
>>>> execution code) which cleaned up this handling.
>>>>
>>>>>> See some profiling below for a
>>>>>> run which barely reaches DRAM training in U-Boot.
>>>>>
>>>>> Some more profiling on both ast2500 and ast2600 machines shows :
>>>>>
>>>>>
>>>>> * ast2600-evb,execute-in-place=true :
>>>>>
>>>>> Type               Object  Call site                Wait Time (s)         Count  Average (us)
>>>>> ---------------------------------------------------------------------------------------------
>>>>> BQL mutex  0x564dc03922e0  accel/tcg/cputlb.c:1365       14.21443
>>>>> 32909927          0.43
>>>> This is unavoidable as a HW access needs the BQL held so we will go
>>>> through this cycle every executed instruction.
>>>> Did I miss why the flash contents are not mapped into the physical
>>>> address space? Isn't that how it appear to the processor?
>>>
>>>
>>> There are two modes :
>>>            if (ASPEED_MACHINE(machine)->mmio_exec) {
>>>              memory_region_init_alias(boot_rom, NULL, "aspeed.boot_rom",
>>>                                       &fl->mmio, 0, size);
>>>              memory_region_add_subregion(get_system_memory(), FIRMWARE_ADDR,
>>>                                          boot_rom);
>>>          } else {
>>>              memory_region_init_rom(boot_rom, NULL, "aspeed.boot_rom",
>>>                                     size, &error_abort);
>>>              memory_region_add_subregion(get_system_memory(), FIRMWARE_ADDR,
>>>                                          boot_rom);
>>>              write_boot_rom(drive0, FIRMWARE_ADDR, size, &error_abort);
>>>          }
>>>
>>> The default boot mode uses the ROM. No issue.
>>>
>>> The "execute-in-place=true" option creates an alias on the region of
>>> the flash contents and each instruction is then fetched from the flash
>>> drive with SPI transactions.

I wonder if we were keeping the code before?

>>>
>>> With old FW images, using an older U-boot, the machine boots in a couple
>>> of seconds. See the profiling below for a witherspoon-bmc machine using
>>> U-Boot 2016.07.
>>>
>>>    qemu-system-arm -M witherspoon-bmc,execute-in-place=true  -drive file=./flash-witherspoon-bmc,format=raw,if=mtd -drive file=./flash-witherspoon-bmc2,format=raw,if=mtd -nographic -nodefaults -snapshot -serial mon:stdio -enable-sync-profile
>>>    ...
>>>    U-Boot 2016.07-00040-g8425e96e2e27-dirty (Jun 24 2022 - 23:21:57 +0200)
>>>              Watchdog enabled
>>>    DRAM:  496 MiB
>>>    Flash: 32 MiB
>>>    In:    serial
>>>    Out:   serial
>>>    Err:   serial
>>>    Net:
>>>    (qemu) info sync-profile
>>>    Type               Object  Call site                Wait Time (s)         Count  Average (us)
>>>    ---------------------------------------------------------------------------------------------
>>>    BQL mutex  0x56189610b2e0  accel/tcg/cputlb.c:1365        0.25311      12346237          0.02
>>>    condvar    0x5618970cf220  softmmu/cpus.c:423             0.05506             2      27530.78
>>>    BQL mutex  0x56189610b2e0  util/rcu.c:269                 0.04709             2      23544.26
>>>    condvar    0x561896d0fc78  util/thread-pool.c:90          0.01340            83        161.47
>>>    condvar    0x56189610b240  softmmu/cpus.c:571             0.00005             1         54.93
>>>    condvar    0x56189610b280  softmmu/cpus.c:642             0.00003             1         32.88
>>>    BQL mutex  0x56189610b2e0  util/main-loop.c:318           0.00003            34          0.76
>>>    mutex      0x561896eade00  tcg/region.c:204               0.00002           995          0.02
>>>    rec_mutex  [           2]  util/async.c:682               0.00002           493          0.03
>>>    mutex      [           2]  chardev/char.c:118             0.00001           404          0.03
>>>    ---------------------------------------------------------------------------------------------
>>>
>>>
>>> However, with recent U-boots, it takes quite a while to reach DRAM training.
>>> Close to a minute. See the profiling below for an ast2500-evb machine using
>>> U-Boot 2019.04.
>>>
>>>   qemu-system-arm -M ast2500-evb,execute-in-place=true  -net nic,macaddr=C0:FF:EE:00:00:03,netdev=net0  -drive file=./flash-ast2500-evb,format=raw,if=mtd  -nographic -nodefaults -snapshot -serial mon:stdio  -enable-sync-profile
>>>    qemu-system-arm: warning: Aspeed iBT has no chardev backend
>>>    qemu-system-arm: warning: nic ftgmac100.1 has no peer
>>>    QEMU 7.0.50 monitor - type 'help' for more information
>>>       U-Boot 2019.04-00080-g6ca27db3f97b-dirty (Jun 24 2022 - 23:22:03
>>>      +0200)
>>>       SOC : AST2500-A1
>>>    RST : Power On
>>>    LPC Mode : SIO:Enable : SuperIO-2e
>>>    Eth : MAC0: RGMII, , MAC1: RGMII,
>>>    Model: AST2500 EVB
>>>    DRAM:  448 MiB (capacity:512 MiB, VGA:64 MiB, ECC:off)
>>>    MMC:   sdhci_slot0@100: 0, sdhci_slot1@200: 1
>>>    Loading Environment from SPI Flash... SF: Detected mx25l25635e with page size 256 Bytes, erase size 64 KiB, total 32 MiB
>>>    *** Warning - bad CRC, using default environment
>>>       In:    serial@1e784000
>>>    Out:   serial@1e784000
>>>    Err:   serial@1e784000
>>>    Net:   eth0: ethernet@1e660000
>>>    Warning: ethernet@1e680000 (eth1) using random MAC address - 4a:e5:9a:4a:c7:c5
>>>    , eth1: ethernet@1e680000
>>>    Hit any key to stop autoboot:  2
>>>    (qemu) info sync-profile
>>>    Type               Object  Call site                Wait Time (s)         Count  Average (us)
>>>    ---------------------------------------------------------------------------------------------
>>>    condvar    0x561f10c9ef88  util/thread-pool.c:90         10.01196            28     357570.00
>>>    BQL mutex  0x561f102362e0  accel/tcg/cputlb.c:1365        0.29496      14248621          0.02
>>>    condvar    0x561f110325a0  softmmu/cpus.c:423             0.02231             2      11152.57
>>>    BQL mutex  0x561f102362e0  util/rcu.c:269                 0.01447             4       3618.60
>>>    condvar    0x561f10236240  softmmu/cpus.c:571             0.00010             1        102.19
>>>    mutex      0x561f10e9f1c0  tcg/region.c:204               0.00007          3052          0.02
>>>    mutex      [           2]  chardev/char.c:118             0.00003          1486          0.02
>>>    condvar    0x561f10236280  softmmu/cpus.c:642             0.00003             1         29.38
>>>    BQL mutex  0x561f102362e0  accel/tcg/cputlb.c:1426        0.00002           973          0.02
>>>    BQL mutex  0x561f102362e0  util/main-loop.c:318           0.00001            34          0.41
>>>    ---------------------------------------------------------------------------------------------
>>>    Something in the layout of the FW is making a big difference. One
>>> that could be relevant is that the recent versions are using a device
>>> tree.
>>>
>>> There might be no good solution to this issue but I fail to analyze
>>> it correctly. Is there a way to collect information on the usage of
>>> Translation Blocks ?
>> You could expand the data we collect in tb_tree_stats and expose it
>> via
>> info jit.
>
> The "fast" run, U-Boot 2016.07, gives :
>
>
>   Translation buffer state:
>   gen code size       254880371/1073736704
>   TB count            1089
>   TB avg target size  16 max=356 bytes
>   TB avg host size    278 bytes (expansion ratio: 17.2)
>   cross page TB count 0 (0%)
>   direct jump count   501 (46%) (2 jumps=372 34%)
>   TB hash buckets     1025/8192 (12.51% head buckets used)
>   TB hash occupancy   3.32% avg chain occ. Histogram: [0.0,7.5)%|█  ▁  ▁  ▁|[67.5,75.0]%
>   TB hash avg chain   1.000 buckets. Histogram: 1|█|1
>      Statistics:
>   TB flush count      0
>   TB invalidate count 0
>   TLB full flushes    0
>   TLB partial flushes 2
>   TLB elided flushes  2338
>   JIT cycles          2221788409 (0.926 s at 2.4 GHz)
>   translated TBs      738520 (aborted=0 0.0%)
>   avg ops/TB          15.7 max=459
>   deleted ops/TB      2.72
>   avg temps/TB        32.89 max=88
>   avg host code/TB    113.7
>   avg search data/TB  5.2
>   cycles/op           192.0
>   cycles/in byte      748.7
>   cycles/out byte     26.4
>   cycles/search byte     582.8
>     gen_interm time   57.6%
>     gen_code time     42.4%
>   optim./code time    19.4%
>   liveness/code time  26.1%
>   cpu_restore count   0
>     avg cycles        0.0
>   and the "slow", U-Boot 2019.04 :
>
> Translation buffer state:
> gen code size       368603795/1073736704
> TB count            3052
> TB avg target size  16 max=360 bytes
> TB avg host size    293 bytes (expansion ratio: 17.6)
> cross page TB count 0 (0%)
> direct jump count   1431 (46%) (2 jumps=1104 36%)
> TB hash buckets     2559/8192 (31.24% head buckets used)
> TB hash occupancy   9.31% avg chain occ. Histogram: [0,10)%|█ ▃  ▁ ▁ ▁|[90,100]%
> TB hash avg chain   1.000 buckets. Histogram: 1|█|1
>
> Statistics:
> TB flush count      3

Something triggers 3 complete TB flushes in the slow run (3x1000 ~ 3000
TBs). This can be the TB overflowing but:

fast: gen code size       254880371/1073736704
slow: gen code size       368603795/1073736704

doesn't look like we are getting close to overflowing. Must be something
else.

> TB invalidate count 0
> TLB full flushes    0
> TLB partial flushes 3
> TLB elided flushes  2367
> JIT cycles          26479044772 (11.033 s at 2.4 GHz)
> translated TBs      10552169 (aborted=0 0.0%)
> avg ops/TB          15.0 max=464
> deleted ops/TB      2.44
> avg temps/TB        32.43 max=89
> avg host code/TB    99.0
> avg search data/TB  5.0
> cycles/op           167.7
> cycles/in byte      626.8
> cycles/out byte     25.4
> cycles/search byte     499.4
>   gen_interm time   50.4%
>   gen_code time     49.6%
> optim./code time    19.5%
> liveness/code time  27.7%
> cpu_restore count   0
>   avg cycles        0.0
>
>
> A lot more TBs.
>   C.


-- 
Alex Bennée


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 12/14] aspeed: Make aspeed_board_init_flashes public
  2022-06-30  9:43                   ` Alex Bennée
@ 2022-07-05 12:35                     ` Cédric Le Goater
  0 siblings, 0 replies; 19+ messages in thread
From: Cédric Le Goater @ 2022-07-05 12:35 UTC (permalink / raw)
  To: Alex Bennée
  Cc: Peter Delevoryas, Peter Maydell, Andrew Jeffery, Joel Stanley,
	pbonzini, berrange, eduardo, marcel.apfelbaum, richard.henderson,
	Philippe Mathieu-Daudé,
	ani, Cameron Esfahani via, qemu-arm

Hello Alex,

On 6/30/22 11:43, Alex Bennée wrote:
> 
> Cédric Le Goater <clg@kaod.org> writes:
> 
>> On 6/29/22 20:24, Alex Bennée wrote:
>>> Cédric Le Goater <clg@kaod.org> writes:
>>>
>>>> On 6/29/22 16:14, Alex Bennée wrote:
>>>>> Cédric Le Goater <clg@kaod.org> writes:
>>>>>
>>>>>> On 6/24/22 18:50, Cédric Le Goater wrote:
>>>>>>> On 6/23/22 20:43, Peter Delevoryas wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>> On Jun 23, 2022, at 8:09 AM, Cédric Le Goater <clg@kaod.org> wrote:
>>>>>>>>>
>>>>>>>>> On 6/23/22 12:26, Peter Delevoryas wrote:
>>>>>>>>>> Signed-off-by: Peter Delevoryas <pdel@fb.com>
>>>>>>>>>
>>>>>>>>> Let's start simple without flash support. We should be able to
>>>>>>>>> load FW blobs in each CPU address space using loader devices.
>>>>>>>>
>>>>>>>> Actually, I was unable to do this, perhaps because the fb OpenBMC
>>>>>>>> boot sequence is a little weird. I specifically _needed_ to have
>>>>>>>> a flash device which maps the firmware in at 0x2000_0000, because
>>>>>>>> the fb OpenBMC U-Boot SPL jumps to that address to start executing
>>>>>>>> from flash? I think this is also why fb OpenBMC machines can be so slow.
>>>>>>>>
>>>>>>>> $ ./build/qemu-system-arm -machine fby35 \
>>>>>>>>         -device loader,file=fby35.mtd,addr=0,cpu-num=0 -nographic \
>>>>>>>>         -d int -drive file=fby35.mtd,format=raw,if=mtd
>>>>>>> Ideally we should be booting from the flash device directly using
>>>>>>> the machine option '-M ast2600-evb,execute-in-place=true' like HW
>>>>>>> does. Instructions are fetched using SPI transfers. But the amount
>>>>>>> of code generated is tremendous.
>>>>> Yeah because there is a potential race when reading from HW so we
>>>>> throw
>>>>> away TB's after executing them because we have no way of knowing if it
>>>>> has changed under our feet. See 873d64ac30 (accel/tcg: re-factor non-RAM
>>>>> execution code) which cleaned up this handling.
>>>>>
>>>>>>> See some profiling below for a
>>>>>>> run which barely reaches DRAM training in U-Boot.
>>>>>>
>>>>>> Some more profiling on both ast2500 and ast2600 machines shows :
>>>>>>
>>>>>>
>>>>>> * ast2600-evb,execute-in-place=true :
>>>>>>
>>>>>> Type               Object  Call site                Wait Time (s)         Count  Average (us)
>>>>>> ---------------------------------------------------------------------------------------------
>>>>>> BQL mutex  0x564dc03922e0  accel/tcg/cputlb.c:1365       14.21443
>>>>>> 32909927          0.43
>>>>> This is unavoidable as a HW access needs the BQL held so we will go
>>>>> through this cycle every executed instruction.
>>>>> Did I miss why the flash contents are not mapped into the physical
>>>>> address space? Isn't that how it appear to the processor?
>>>>
>>>>
>>>> There are two modes :
>>>>             if (ASPEED_MACHINE(machine)->mmio_exec) {
>>>>               memory_region_init_alias(boot_rom, NULL, "aspeed.boot_rom",
>>>>                                        &fl->mmio, 0, size);
>>>>               memory_region_add_subregion(get_system_memory(), FIRMWARE_ADDR,
>>>>                                           boot_rom);
>>>>           } else {
>>>>               memory_region_init_rom(boot_rom, NULL, "aspeed.boot_rom",
>>>>                                      size, &error_abort);
>>>>               memory_region_add_subregion(get_system_memory(), FIRMWARE_ADDR,
>>>>                                           boot_rom);
>>>>               write_boot_rom(drive0, FIRMWARE_ADDR, size, &error_abort);
>>>>           }
>>>>
>>>> The default boot mode uses the ROM. No issue.
>>>>
>>>> The "execute-in-place=true" option creates an alias on the region of
>>>> the flash contents and each instruction is then fetched from the flash
>>>> drive with SPI transactions.
> 
> I wonder if we were keeping the code before?
> 
>>>>
>>>> With old FW images, using an older U-boot, the machine boots in a couple
>>>> of seconds. See the profiling below for a witherspoon-bmc machine using
>>>> U-Boot 2016.07.
>>>>
>>>>     qemu-system-arm -M witherspoon-bmc,execute-in-place=true  -drive file=./flash-witherspoon-bmc,format=raw,if=mtd -drive file=./flash-witherspoon-bmc2,format=raw,if=mtd -nographic -nodefaults -snapshot -serial mon:stdio -enable-sync-profile
>>>>     ...
>>>>     U-Boot 2016.07-00040-g8425e96e2e27-dirty (Jun 24 2022 - 23:21:57 +0200)
>>>>               Watchdog enabled
>>>>     DRAM:  496 MiB
>>>>     Flash: 32 MiB
>>>>     In:    serial
>>>>     Out:   serial
>>>>     Err:   serial
>>>>     Net:
>>>>     (qemu) info sync-profile
>>>>     Type               Object  Call site                Wait Time (s)         Count  Average (us)
>>>>     ---------------------------------------------------------------------------------------------
>>>>     BQL mutex  0x56189610b2e0  accel/tcg/cputlb.c:1365        0.25311      12346237          0.02
>>>>     condvar    0x5618970cf220  softmmu/cpus.c:423             0.05506             2      27530.78
>>>>     BQL mutex  0x56189610b2e0  util/rcu.c:269                 0.04709             2      23544.26
>>>>     condvar    0x561896d0fc78  util/thread-pool.c:90          0.01340            83        161.47
>>>>     condvar    0x56189610b240  softmmu/cpus.c:571             0.00005             1         54.93
>>>>     condvar    0x56189610b280  softmmu/cpus.c:642             0.00003             1         32.88
>>>>     BQL mutex  0x56189610b2e0  util/main-loop.c:318           0.00003            34          0.76
>>>>     mutex      0x561896eade00  tcg/region.c:204               0.00002           995          0.02
>>>>     rec_mutex  [           2]  util/async.c:682               0.00002           493          0.03
>>>>     mutex      [           2]  chardev/char.c:118             0.00001           404          0.03
>>>>     ---------------------------------------------------------------------------------------------
>>>>
>>>>
>>>> However, with recent U-boots, it takes quite a while to reach DRAM training.
>>>> Close to a minute. See the profiling below for an ast2500-evb machine using
>>>> U-Boot 2019.04.
>>>>
>>>>    qemu-system-arm -M ast2500-evb,execute-in-place=true  -net nic,macaddr=C0:FF:EE:00:00:03,netdev=net0  -drive file=./flash-ast2500-evb,format=raw,if=mtd  -nographic -nodefaults -snapshot -serial mon:stdio  -enable-sync-profile
>>>>     qemu-system-arm: warning: Aspeed iBT has no chardev backend
>>>>     qemu-system-arm: warning: nic ftgmac100.1 has no peer
>>>>     QEMU 7.0.50 monitor - type 'help' for more information
>>>>        U-Boot 2019.04-00080-g6ca27db3f97b-dirty (Jun 24 2022 - 23:22:03
>>>>       +0200)
>>>>        SOC : AST2500-A1
>>>>     RST : Power On
>>>>     LPC Mode : SIO:Enable : SuperIO-2e
>>>>     Eth : MAC0: RGMII, , MAC1: RGMII,
>>>>     Model: AST2500 EVB
>>>>     DRAM:  448 MiB (capacity:512 MiB, VGA:64 MiB, ECC:off)
>>>>     MMC:   sdhci_slot0@100: 0, sdhci_slot1@200: 1
>>>>     Loading Environment from SPI Flash... SF: Detected mx25l25635e with page size 256 Bytes, erase size 64 KiB, total 32 MiB
>>>>     *** Warning - bad CRC, using default environment
>>>>        In:    serial@1e784000
>>>>     Out:   serial@1e784000
>>>>     Err:   serial@1e784000
>>>>     Net:   eth0: ethernet@1e660000
>>>>     Warning: ethernet@1e680000 (eth1) using random MAC address - 4a:e5:9a:4a:c7:c5
>>>>     , eth1: ethernet@1e680000
>>>>     Hit any key to stop autoboot:  2
>>>>     (qemu) info sync-profile
>>>>     Type               Object  Call site                Wait Time (s)         Count  Average (us)
>>>>     ---------------------------------------------------------------------------------------------
>>>>     condvar    0x561f10c9ef88  util/thread-pool.c:90         10.01196            28     357570.00
>>>>     BQL mutex  0x561f102362e0  accel/tcg/cputlb.c:1365        0.29496      14248621          0.02
>>>>     condvar    0x561f110325a0  softmmu/cpus.c:423             0.02231             2      11152.57
>>>>     BQL mutex  0x561f102362e0  util/rcu.c:269                 0.01447             4       3618.60
>>>>     condvar    0x561f10236240  softmmu/cpus.c:571             0.00010             1        102.19
>>>>     mutex      0x561f10e9f1c0  tcg/region.c:204               0.00007          3052          0.02
>>>>     mutex      [           2]  chardev/char.c:118             0.00003          1486          0.02
>>>>     condvar    0x561f10236280  softmmu/cpus.c:642             0.00003             1         29.38
>>>>     BQL mutex  0x561f102362e0  accel/tcg/cputlb.c:1426        0.00002           973          0.02
>>>>     BQL mutex  0x561f102362e0  util/main-loop.c:318           0.00001            34          0.41
>>>>     ---------------------------------------------------------------------------------------------
>>>>     Something in the layout of the FW is making a big difference. One
>>>> that could be relevant is that the recent versions are using a device
>>>> tree.
>>>>
>>>> There might be no good solution to this issue but I fail to analyze
>>>> it correctly. Is there a way to collect information on the usage of
>>>> Translation Blocks ?
>>> You could expand the data we collect in tb_tree_stats and expose it
>>> via
>>> info jit.
>>
>> The "fast" run, U-Boot 2016.07, gives :
>>
>>
>>    Translation buffer state:
>>    gen code size       254880371/1073736704
>>    TB count            1089
>>    TB avg target size  16 max=356 bytes
>>    TB avg host size    278 bytes (expansion ratio: 17.2)
>>    cross page TB count 0 (0%)
>>    direct jump count   501 (46%) (2 jumps=372 34%)
>>    TB hash buckets     1025/8192 (12.51% head buckets used)
>>    TB hash occupancy   3.32% avg chain occ. Histogram: [0.0,7.5)%|█  ▁  ▁  ▁|[67.5,75.0]%
>>    TB hash avg chain   1.000 buckets. Histogram: 1|█|1
>>       Statistics:
>>    TB flush count      0
>>    TB invalidate count 0
>>    TLB full flushes    0
>>    TLB partial flushes 2
>>    TLB elided flushes  2338
>>    JIT cycles          2221788409 (0.926 s at 2.4 GHz)
>>    translated TBs      738520 (aborted=0 0.0%)
>>    avg ops/TB          15.7 max=459
>>    deleted ops/TB      2.72
>>    avg temps/TB        32.89 max=88
>>    avg host code/TB    113.7
>>    avg search data/TB  5.2
>>    cycles/op           192.0
>>    cycles/in byte      748.7
>>    cycles/out byte     26.4
>>    cycles/search byte     582.8
>>      gen_interm time   57.6%
>>      gen_code time     42.4%
>>    optim./code time    19.4%
>>    liveness/code time  26.1%
>>    cpu_restore count   0
>>      avg cycles        0.0
>>    and the "slow", U-Boot 2019.04 :
>>
>> Translation buffer state:
>> gen code size       368603795/1073736704
>> TB count            3052
>> TB avg target size  16 max=360 bytes
>> TB avg host size    293 bytes (expansion ratio: 17.6)
>> cross page TB count 0 (0%)
>> direct jump count   1431 (46%) (2 jumps=1104 36%)
>> TB hash buckets     2559/8192 (31.24% head buckets used)
>> TB hash occupancy   9.31% avg chain occ. Histogram: [0,10)%|█ ▃  ▁ ▁ ▁|[90,100]%
>> TB hash avg chain   1.000 buckets. Histogram: 1|█|1
>>
>> Statistics:
>> TB flush count      3
> 
> Something triggers 3 complete TB flushes in the slow run (3x1000 ~ 3000
> TBs). This can be the TB overflowing but:
> 
> fast: gen code size       254880371/1073736704
> slow: gen code size       368603795/1073736704
> 
> doesn't look like we are getting close to overflowing. Must be something
> else.


Here are images to reproduce. These machines use the same AST2500 SoC.

* U-Boot 2016.07

   wget https://github.com/openbmc/openbmc/releases/download/2.9.0/obmc-phosphor-image-romulus.static.mtd

   qemu-system-arm -M romulus-bmc -drive file=./obmc-phosphor-image-romulus.static.mtd,format=raw,if=mtd -nographic
   qemu-system-arm -M romulus-bmc,execute-in-place=true -drive file=./obmc-phosphor-image-romulus.static.mtd,format=raw,if=mtd -nographic

* U-Boot 2019.04

   https://github.com/legoater/qemu-aspeed-boot/raw/master/images/ast2500-evb/buildroot-2022.05/flash.img

   qemu-system-arm -M ast2500-evb -drive file=./flash.img,format=raw,if=mtd -nographic
   qemu-system-arm -M ast2500-evb,execute-in-place=true -drive file=./flash.img,format=raw,if=mtd -nographic


Thanks,

C.


> 
>> TB invalidate count 0
>> TLB full flushes    0
>> TLB partial flushes 3
>> TLB elided flushes  2367
>> JIT cycles          26479044772 (11.033 s at 2.4 GHz)
>> translated TBs      10552169 (aborted=0 0.0%)
>> avg ops/TB          15.0 max=464
>> deleted ops/TB      2.44
>> avg temps/TB        32.43 max=89
>> avg host code/TB    99.0
>> avg search data/TB  5.0
>> cycles/op           167.7
>> cycles/in byte      626.8
>> cycles/out byte     25.4
>> cycles/search byte     499.4
>>    gen_interm time   50.4%
>>    gen_code time     49.6%
>> optim./code time    19.5%
>> liveness/code time  27.7%
>> cpu_restore count   0
>>    avg cycles        0.0
>>
>>
>> A lot more TBs.
>>    C.
> 
> 



^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2022-07-05 12:39 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20220623102617.2164175-1-pdel@fb.com>
     [not found] ` <20220623102617.2164175-3-pdel@fb.com>
2022-06-23 12:11   ` [PATCH 02/14] sysbus: Remove sysbus_address_space Peter Maydell
     [not found] ` <20220623102617.2164175-5-pdel@fb.com>
2022-06-23 12:15   ` [PATCH 04/14] sysbus: Add sysbus_mmio_map_in Peter Maydell
2022-06-23 18:29     ` Peter Delevoryas
2022-06-23 18:29       ` Peter Delevoryas
     [not found] ` <20220623102617.2164175-9-pdel@fb.com>
2022-06-23 12:57   ` [PATCH 08/14] aspeed: Replace direct get_system_memory() calls Peter Maydell
2022-06-23 15:39     ` Cédric Le Goater
2022-06-23 18:45       ` Peter Delevoryas
2022-06-23 18:45         ` Peter Delevoryas
     [not found] ` <20220623102617.2164175-13-pdel@fb.com>
2022-06-23 15:09   ` [PATCH 12/14] aspeed: Make aspeed_board_init_flashes public Cédric Le Goater
2022-06-23 18:43     ` Peter Delevoryas
2022-06-23 18:43       ` Peter Delevoryas
2022-06-24 16:50       ` Cédric Le Goater
2022-06-29  9:11         ` Cédric Le Goater
2022-06-29 14:14           ` Alex Bennée
2022-06-29 15:54             ` Cédric Le Goater
2022-06-29 18:24               ` Alex Bennée
2022-06-30  8:49                 ` Cédric Le Goater
2022-06-30  9:43                   ` Alex Bennée
2022-07-05 12:35                     ` Cédric Le Goater

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.