All of lore.kernel.org
 help / color / mirror / Atom feed
* [U-Boot] [PATCH 0/3] arm: reduce .bss section clear time
@ 2015-01-28 12:55 Przemyslaw Marczak
  2015-01-28 12:55 ` [U-Boot] [PATCH 1/3] exynos: config: enable arch memcpy and arch memset Przemyslaw Marczak
                   ` (4 more replies)
  0 siblings, 5 replies; 25+ messages in thread
From: Przemyslaw Marczak @ 2015-01-28 12:55 UTC (permalink / raw)
  To: u-boot

This patchset reduces the boot time for ARM architecture,
Exynos boards, and boards with DFU enabled(ARM).

For tested Trats2 device, this was done in three steps.

First was enable the arch memcpy and memset.
The second step was enable memset for .bss clear.
The third step for reduce this operation is to keep .bss section
small as possible.

The .bss section will grow if we have a lot of static variables.
This section is cleared before jump to the relocated U-Boot,
and it's done word by word. To reduce the time for this step,
we can enable arch memset, which uses multiple ARM registers.

For configs with DFU enabled, we can find the dfu buffer in this section,
which has at least 8MB (32MB for trats2). This is a lot of useless data,
which is not required for standard boot. So this buffer should be dynamic
allocated.

Przemyslaw Marczak (3):
  exynos: config: enable arch memcpy and arch memset
  arm: relocation: clear .bss section with arch memset if defined
  dfu: mmc: file buffer: remove static allocation

 arch/arm/lib/crt0.S             | 10 +++++++++-
 drivers/dfu/dfu_mmc.c           | 25 ++++++++++++++++++++++---
 include/configs/exynos-common.h |  3 +++
 3 files changed, 34 insertions(+), 4 deletions(-)

-- 
1.9.1

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [U-Boot] [PATCH 1/3] exynos: config: enable arch memcpy and arch memset
  2015-01-28 12:55 [U-Boot] [PATCH 0/3] arm: reduce .bss section clear time Przemyslaw Marczak
@ 2015-01-28 12:55 ` Przemyslaw Marczak
  2015-01-28 12:55 ` [U-Boot] [PATCH 2/3] arm: relocation: clear .bss section with arch memset if defined Przemyslaw Marczak
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 25+ messages in thread
From: Przemyslaw Marczak @ 2015-01-28 12:55 UTC (permalink / raw)
  To: u-boot

This commit enables the following configs:
- CONFIG_USE_ARCH_MEMCPY
- CONFIG_USE_ARCH_MEMSET
This increases the performance of memcpy/memset
and also reduces the boot time.

This was tested on Trats2.
A quick test with trace. Boot time from start to main_loop() entry:
- ~1527ms - before this change (arch memset enabled for .bss clear)
- ~1384ms - after this change

Signed-off-by: Przemyslaw Marczak <p.marczak@samsung.com>
Cc: Minkyu Kang <mk7.kang@samsung.com>
Cc: Akshay Saraswat <akshay.s@samsung.com>
Cc: Simon Glass <sjg@chromium.org>
Cc: Sjoerd Simons <sjoerd.simons@collabora.co.uk>
---
 include/configs/exynos-common.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/include/configs/exynos-common.h b/include/configs/exynos-common.h
index 1f3ee55..5c14c40 100644
--- a/include/configs/exynos-common.h
+++ b/include/configs/exynos-common.h
@@ -30,6 +30,9 @@
 #define CONFIG_SKIP_LOWLEVEL_INIT
 #define CONFIG_BOARD_EARLY_INIT_F
 
+#define CONFIG_USE_ARCH_MEMCPY
+#define CONFIG_USE_ARCH_MEMSET
+
 /* Keep L2 Cache Disabled */
 #define CONFIG_CMD_CACHE
 
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [U-Boot] [PATCH 2/3] arm: relocation: clear .bss section with arch memset if defined
  2015-01-28 12:55 [U-Boot] [PATCH 0/3] arm: reduce .bss section clear time Przemyslaw Marczak
  2015-01-28 12:55 ` [U-Boot] [PATCH 1/3] exynos: config: enable arch memcpy and arch memset Przemyslaw Marczak
@ 2015-01-28 12:55 ` Przemyslaw Marczak
  2015-02-01  2:38   ` Albert ARIBAUD
  2015-01-28 12:55 ` [U-Boot] [PATCH 3/3] dfu: mmc: file buffer: remove static allocation Przemyslaw Marczak
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 25+ messages in thread
From: Przemyslaw Marczak @ 2015-01-28 12:55 UTC (permalink / raw)
  To: u-boot

For ARM architecture, enable the CONFIG_USE_ARCH_MEMSET/MEMCPY,
will highly increase the memset/memcpy performance. This is able
thanks to the ARM multiple register instructions.

Unfortunatelly the relocation is done without the cache enabled,
so it takes some time, but zeroing the BSS memory takes much more
longer, especially for the configs with big static buffers.

A quick test confirms, that the boot time improvement after using
the arch memcpy for relocation has no significant meaning.
The same test confirms that enable the memset for zeroing BSS,
reduces the boot time.

So this patch enables the arch memset for zeroing the BSS after
the relocation process. For ARM boards, this can be enabled
in board configs by defining: 'CONFIG_USE_ARCH_MEMSET'.

This was tested on Trats2.
A quick test with trace. Boot time from start to main_loop() entry:
- ~1384ms - before this change
-  ~888ms - after this change

Signed-off-by: Przemyslaw Marczak <p.marczak@samsung.com>
Cc: Albert Aribaud <albert.u.boot@aribaud.net>
Cc: Tom Rini <trini@ti.com>
---
 arch/arm/lib/crt0.S | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/arch/arm/lib/crt0.S b/arch/arm/lib/crt0.S
index 22df3e5..fab3d2c 100644
--- a/arch/arm/lib/crt0.S
+++ b/arch/arm/lib/crt0.S
@@ -115,14 +115,22 @@ here:
 	bl	c_runtime_cpu_setup	/* we still call old routine here */
 
 	ldr	r0, =__bss_start	/* this is auto-relocated! */
-	ldr	r1, =__bss_end		/* this is auto-relocated! */
 
+#ifdef CONFIG_USE_ARCH_MEMSET
+	ldr	r3, =__bss_end		/* this is auto-relocated! */
+	mov	r1, #0x00000000		/* prepare zero to clear BSS */
+
+	subs	r2, r3, r0		/* r2 = memset len */
+	bl	memset
+#else
+	ldr	r1, =__bss_end		/* this is auto-relocated! */
 	mov	r2, #0x00000000		/* prepare zero to clear BSS */
 
 clbss_l:cmp	r0, r1			/* while not at end of BSS */
 	strlo	r2, [r0]		/* clear 32-bit BSS word */
 	addlo	r0, r0, #4		/* move to next */
 	blo	clbss_l
+#endif
 
 	bl coloured_LED_init
 	bl red_led_on
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [U-Boot] [PATCH 3/3] dfu: mmc: file buffer: remove static allocation
  2015-01-28 12:55 [U-Boot] [PATCH 0/3] arm: reduce .bss section clear time Przemyslaw Marczak
  2015-01-28 12:55 ` [U-Boot] [PATCH 1/3] exynos: config: enable arch memcpy and arch memset Przemyslaw Marczak
  2015-01-28 12:55 ` [U-Boot] [PATCH 2/3] arm: relocation: clear .bss section with arch memset if defined Przemyslaw Marczak
@ 2015-01-28 12:55 ` Przemyslaw Marczak
  2015-01-28 13:12 ` [U-Boot] [PATCH 0/3] arm: reduce .bss section clear time Stefan Roese
  2015-01-29 16:48 ` Przemyslaw Marczak
  4 siblings, 0 replies; 25+ messages in thread
From: Przemyslaw Marczak @ 2015-01-28 12:55 UTC (permalink / raw)
  To: u-boot

For writing files, DFU implementation requires the file buffer
with the len at least of file size. For big files it requires
the same big buffer.

Previously the file buffer was allocated as a static variable,
so it was a part of U-Boot .bss section. For 32MiB len of buffer
we have 32MiB of additional space, required for this section.

The .bss needs to be cleared after the relocation.
This introduces an additional boot delay at every start, but usually
the dfu feature is not required at the standard boot, so the buffer
should be allocated only if required.

This patch removes the static allocation of this buffer,
and alloc it with memalign after first call of function:
- dfu_fill_entity_mmc()
and the buffer is freed on dfu_free_entity() call.

This was tested on Trats2.
A quick test with trace. Boot time from start to main_loop() entry:
- ~888ms - before this change (arch memset enabled for .bss clear)
- ~464ms - after this change

Signed-off-by: Przemyslaw Marczak <p.marczak@samsung.com>
Cc: Lukasz Majewski <l.majewski@samsung.com>
Cc: Stephen Warren <swarren@nvidia.com>
Cc: Pantelis Antoniou <panto@antoniou-consulting.com>
Cc: Tom Rini <trini@ti.com>
Cc: Marek Vasut <marek.vasut@gmail.com>
---
 drivers/dfu/dfu_mmc.c | 25 ++++++++++++++++++++++---
 1 file changed, 22 insertions(+), 3 deletions(-)

diff --git a/drivers/dfu/dfu_mmc.c b/drivers/dfu/dfu_mmc.c
index 62d72fe..fd865e1 100644
--- a/drivers/dfu/dfu_mmc.c
+++ b/drivers/dfu/dfu_mmc.c
@@ -16,8 +16,7 @@
 #include <fat.h>
 #include <mmc.h>
 
-static unsigned char __aligned(CONFIG_SYS_CACHELINE_SIZE)
-				dfu_file_buf[CONFIG_SYS_DFU_MAX_FILE_SIZE];
+static unsigned char *dfu_file_buf;
 static long dfu_file_buf_len;
 
 static int mmc_access_part(struct dfu_entity *dfu, struct mmc *mmc, int part)
@@ -211,7 +210,7 @@ int dfu_flush_medium_mmc(struct dfu_entity *dfu)
 
 	if (dfu->layout != DFU_RAW_ADDR) {
 		/* Do stuff here. */
-		ret = mmc_file_op(DFU_OP_WRITE, dfu, &dfu_file_buf,
+		ret = mmc_file_op(DFU_OP_WRITE, dfu, dfu_file_buf,
 				&dfu_file_buf_len);
 
 		/* Now that we're done */
@@ -263,6 +262,14 @@ int dfu_read_medium_mmc(struct dfu_entity *dfu, u64 offset, void *buf,
 	return ret;
 }
 
+void dfu_free_entity_mmc(struct dfu_entity *dfu)
+{
+	if (dfu_file_buf) {
+		free(dfu_file_buf);
+		dfu_file_buf = NULL;
+	}
+}
+
 /*
  * @param s Parameter string containing space-separated arguments:
  *	1st:
@@ -370,6 +377,18 @@ int dfu_fill_entity_mmc(struct dfu_entity *dfu, char *devstr, char *s)
 	dfu->write_medium = dfu_write_medium_mmc;
 	dfu->flush_medium = dfu_flush_medium_mmc;
 	dfu->inited = 0;
+	dfu->free_entity = dfu_free_entity_mmc;
+
+	/* Check if file buffer is ready */
+	if (!dfu_file_buf) {
+		dfu_file_buf = memalign(CONFIG_SYS_CACHELINE_SIZE,
+					CONFIG_SYS_DFU_MAX_FILE_SIZE);
+		if (!dfu_file_buf) {
+			error("Could not memalign 0x%x bytes",
+			      CONFIG_SYS_DFU_MAX_FILE_SIZE);
+			return -ENOMEM;
+		}
+	}
 
 	return 0;
 }
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [U-Boot] [PATCH 0/3] arm: reduce .bss section clear time
  2015-01-28 12:55 [U-Boot] [PATCH 0/3] arm: reduce .bss section clear time Przemyslaw Marczak
                   ` (2 preceding siblings ...)
  2015-01-28 12:55 ` [U-Boot] [PATCH 3/3] dfu: mmc: file buffer: remove static allocation Przemyslaw Marczak
@ 2015-01-28 13:12 ` Stefan Roese
  2015-01-28 14:10   ` Przemyslaw Marczak
  2015-01-29 16:48 ` Przemyslaw Marczak
  4 siblings, 1 reply; 25+ messages in thread
From: Stefan Roese @ 2015-01-28 13:12 UTC (permalink / raw)
  To: u-boot

Hi Przemyslaw,

On 28.01.2015 13:55, Przemyslaw Marczak wrote:
> This patchset reduces the boot time for ARM architecture,
> Exynos boards, and boards with DFU enabled(ARM).
>
> For tested Trats2 device, this was done in three steps.
>
> First was enable the arch memcpy and memset.
> The second step was enable memset for .bss clear.
> The third step for reduce this operation is to keep .bss section
> small as possible.
>
> The .bss section will grow if we have a lot of static variables.
> This section is cleared before jump to the relocated U-Boot,
> and it's done word by word. To reduce the time for this step,
> we can enable arch memset, which uses multiple ARM registers.
>
> For configs with DFU enabled, we can find the dfu buffer in this section,
> which has at least 8MB (32MB for trats2). This is a lot of useless data,
> which is not required for standard boot. So this buffer should be dynamic
> allocated.
>
> Przemyslaw Marczak (3):
>    exynos: config: enable arch memcpy and arch memset
>    arm: relocation: clear .bss section with arch memset if defined
>    dfu: mmc: file buffer: remove static allocation
>
>   arch/arm/lib/crt0.S             | 10 +++++++++-
>   drivers/dfu/dfu_mmc.c           | 25 ++++++++++++++++++++++---
>   include/configs/exynos-common.h |  3 +++
>   3 files changed, 34 insertions(+), 4 deletions(-)

Looking at the commit messages of this patchset I can conclude that your 
overall boot time reduction is:

from ~1527ms
to ~464ms

This is amazing! Congrats. :)

We really should in general make more use of the optimized functions and 
take care that the buffers (e.g. the DFU buffer in this case) are used 
in a sane way.

Thanks,
Stefan

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [U-Boot] [PATCH 0/3] arm: reduce .bss section clear time
  2015-01-28 13:12 ` [U-Boot] [PATCH 0/3] arm: reduce .bss section clear time Stefan Roese
@ 2015-01-28 14:10   ` Przemyslaw Marczak
  2015-01-28 14:18     ` Pantelis Antoniou
  0 siblings, 1 reply; 25+ messages in thread
From: Przemyslaw Marczak @ 2015-01-28 14:10 UTC (permalink / raw)
  To: u-boot

Hello Stefan,

On 01/28/2015 02:12 PM, Stefan Roese wrote:
> Hi Przemyslaw,
>
> On 28.01.2015 13:55, Przemyslaw Marczak wrote:
>> This patchset reduces the boot time for ARM architecture,
>> Exynos boards, and boards with DFU enabled(ARM).
>>
>> For tested Trats2 device, this was done in three steps.
>>
>> First was enable the arch memcpy and memset.
>> The second step was enable memset for .bss clear.
>> The third step for reduce this operation is to keep .bss section
>> small as possible.
>>
>> The .bss section will grow if we have a lot of static variables.
>> This section is cleared before jump to the relocated U-Boot,
>> and it's done word by word. To reduce the time for this step,
>> we can enable arch memset, which uses multiple ARM registers.
>>
>> For configs with DFU enabled, we can find the dfu buffer in this section,
>> which has at least 8MB (32MB for trats2). This is a lot of useless data,
>> which is not required for standard boot. So this buffer should be dynamic
>> allocated.
>>
>> Przemyslaw Marczak (3):
>>    exynos: config: enable arch memcpy and arch memset
>>    arm: relocation: clear .bss section with arch memset if defined
>>    dfu: mmc: file buffer: remove static allocation
>>
>>   arch/arm/lib/crt0.S             | 10 +++++++++-
>>   drivers/dfu/dfu_mmc.c           | 25 ++++++++++++++++++++++---
>>   include/configs/exynos-common.h |  3 +++
>>   3 files changed, 34 insertions(+), 4 deletions(-)
>
> Looking at the commit messages of this patchset I can conclude that your
> overall boot time reduction is:
>
> from ~1527ms
> to ~464ms
>
> This is amazing! Congrats. :)
>

Thank you. I was also amazed.

The time results are taken with from the clock cycle counter, I think 
it's reliable.  Some day I would like to check it using the oscilloscope.

> We really should in general make more use of the optimized functions and
> take care that the buffers (e.g. the DFU buffer in this case) are used
> in a sane way.
>
> Thanks,
> Stefan
>
>

Yes you're right, I thought that Exynos config has enabled arch 
memcpy/set lib, before I checked this...

Best regards,
-- 
Przemyslaw Marczak
Samsung R&D Institute Poland
Samsung Electronics
p.marczak at samsung.com

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [U-Boot] [PATCH 0/3] arm: reduce .bss section clear time
  2015-01-28 14:10   ` Przemyslaw Marczak
@ 2015-01-28 14:18     ` Pantelis Antoniou
  2015-01-28 14:30       ` Przemyslaw Marczak
  0 siblings, 1 reply; 25+ messages in thread
From: Pantelis Antoniou @ 2015-01-28 14:18 UTC (permalink / raw)
  To: u-boot

Hi Przemyslaw,

> On Jan 28, 2015, at 16:10 , Przemyslaw Marczak <p.marczak@samsung.com> wrote:
> 
> Hello Stefan,
> 
> On 01/28/2015 02:12 PM, Stefan Roese wrote:
>> Hi Przemyslaw,
>> 
>> On 28.01.2015 13:55, Przemyslaw Marczak wrote:
>>> This patchset reduces the boot time for ARM architecture,
>>> Exynos boards, and boards with DFU enabled(ARM).
>>> 
>>> For tested Trats2 device, this was done in three steps.
>>> 
>>> First was enable the arch memcpy and memset.
>>> The second step was enable memset for .bss clear.
>>> The third step for reduce this operation is to keep .bss section
>>> small as possible.
>>> 
>>> The .bss section will grow if we have a lot of static variables.
>>> This section is cleared before jump to the relocated U-Boot,
>>> and it's done word by word. To reduce the time for this step,
>>> we can enable arch memset, which uses multiple ARM registers.
>>> 
>>> For configs with DFU enabled, we can find the dfu buffer in this section,
>>> which has at least 8MB (32MB for trats2). This is a lot of useless data,
>>> which is not required for standard boot. So this buffer should be dynamic
>>> allocated.
>>> 
>>> Przemyslaw Marczak (3):
>>>   exynos: config: enable arch memcpy and arch memset
>>>   arm: relocation: clear .bss section with arch memset if defined
>>>   dfu: mmc: file buffer: remove static allocation
>>> 
>>>  arch/arm/lib/crt0.S             | 10 +++++++++-
>>>  drivers/dfu/dfu_mmc.c           | 25 ++++++++++++++++++++++---
>>>  include/configs/exynos-common.h |  3 +++
>>>  3 files changed, 34 insertions(+), 4 deletions(-)
>> 
>> Looking at the commit messages of this patchset I can conclude that your
>> overall boot time reduction is:
>> 
>> from ~1527ms
>> to ~464ms
>> 
>> This is amazing! Congrats. :)
>> 
> 
> Thank you. I was also amazed.
> 
> The time results are taken with from the clock cycle counter, I think it's reliable. Some day I would like to check it using the oscilloscope.
> 
>> We really should in general make more use of the optimized functions and
>> take care that the buffers (e.g. the DFU buffer in this case) are used
>> in a sane way.
>> 
>> Thanks,
>> Stefan
>> 
>> 
> 
> Yes you're right, I thought that Exynos config has enabled arch memcpy/set lib, before I checked this?
> 

Those numbers are indeed incredible; I suppose the caches are disabled?


> Best regards,
> -- 
> Przemyslaw Marczak
> Samsung R&D Institute Poland
> Samsung Electronics
> p.marczak at samsung.com

Regards

? Pantelis

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [U-Boot] [PATCH 0/3] arm: reduce .bss section clear time
  2015-01-28 14:18     ` Pantelis Antoniou
@ 2015-01-28 14:30       ` Przemyslaw Marczak
  2015-01-28 14:34         ` Pantelis Antoniou
  0 siblings, 1 reply; 25+ messages in thread
From: Przemyslaw Marczak @ 2015-01-28 14:30 UTC (permalink / raw)
  To: u-boot

Hello,

On 01/28/2015 03:18 PM, Pantelis Antoniou wrote:
> Hi Przemyslaw,
>
>> On Jan 28, 2015, at 16:10 , Przemyslaw Marczak <p.marczak@samsung.com> wrote:
>>
>> Hello Stefan,
>>
>> On 01/28/2015 02:12 PM, Stefan Roese wrote:
>>> Hi Przemyslaw,
>>>
>>> On 28.01.2015 13:55, Przemyslaw Marczak wrote:
>>>> This patchset reduces the boot time for ARM architecture,
>>>> Exynos boards, and boards with DFU enabled(ARM).
>>>>
>>>> For tested Trats2 device, this was done in three steps.
>>>>
>>>> First was enable the arch memcpy and memset.
>>>> The second step was enable memset for .bss clear.
>>>> The third step for reduce this operation is to keep .bss section
>>>> small as possible.
>>>>
>>>> The .bss section will grow if we have a lot of static variables.
>>>> This section is cleared before jump to the relocated U-Boot,
>>>> and it's done word by word. To reduce the time for this step,
>>>> we can enable arch memset, which uses multiple ARM registers.
>>>>
>>>> For configs with DFU enabled, we can find the dfu buffer in this section,
>>>> which has at least 8MB (32MB for trats2). This is a lot of useless data,
>>>> which is not required for standard boot. So this buffer should be dynamic
>>>> allocated.
>>>>
>>>> Przemyslaw Marczak (3):
>>>>    exynos: config: enable arch memcpy and arch memset
>>>>    arm: relocation: clear .bss section with arch memset if defined
>>>>    dfu: mmc: file buffer: remove static allocation
>>>>
>>>>   arch/arm/lib/crt0.S             | 10 +++++++++-
>>>>   drivers/dfu/dfu_mmc.c           | 25 ++++++++++++++++++++++---
>>>>   include/configs/exynos-common.h |  3 +++
>>>>   3 files changed, 34 insertions(+), 4 deletions(-)
>>>
>>> Looking at the commit messages of this patchset I can conclude that your
>>> overall boot time reduction is:
>>>
>>> from ~1527ms
>>> to ~464ms
>>>
>>> This is amazing! Congrats. :)
>>>
>>
>> Thank you. I was also amazed.
>>
>> The time results are taken with from the clock cycle counter, I think it's reliable. Some day I would like to check it using the oscilloscope.
>>
>>> We really should in general make more use of the optimized functions and
>>> take care that the buffers (e.g. the DFU buffer in this case) are used
>>> in a sane way.
>>>
>>> Thanks,
>>> Stefan
>>>
>>>
>>
>> Yes you're right, I thought that Exynos config has enabled arch memcpy/set lib, before I checked this?
>>
>
> Those numbers are indeed incredible; I suppose the caches are disabled?
>
>

The caches are enabled after the relocation, in one of board_init_r calls.

>> Best regards,
>> --
>> Przemyslaw Marczak
>> Samsung R&D Institute Poland
>> Samsung Electronics
>> p.marczak at samsung.com
>
> Regards
>
> ? Pantelis
>
>

Best regards,
-- 
Przemyslaw Marczak
Samsung R&D Institute Poland
Samsung Electronics
p.marczak at samsung.com

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [U-Boot] [PATCH 0/3] arm: reduce .bss section clear time
  2015-01-28 14:30       ` Przemyslaw Marczak
@ 2015-01-28 14:34         ` Pantelis Antoniou
  2015-01-29 15:26           ` Przemyslaw Marczak
  0 siblings, 1 reply; 25+ messages in thread
From: Pantelis Antoniou @ 2015-01-28 14:34 UTC (permalink / raw)
  To: u-boot

Hi Przemyslaw,

> On Jan 28, 2015, at 16:30 , Przemyslaw Marczak <p.marczak@samsung.com> wrote:
> 
> Hello,
> 
> On 01/28/2015 03:18 PM, Pantelis Antoniou wrote:
>> Hi Przemyslaw,
>> 
>>> On Jan 28, 2015, at 16:10 , Przemyslaw Marczak <p.marczak@samsung.com> wrote:
>>> 
>>> Hello Stefan,
>>> 
>>> On 01/28/2015 02:12 PM, Stefan Roese wrote:
>>>> Hi Przemyslaw,
>>>> 
>>>> On 28.01.2015 13:55, Przemyslaw Marczak wrote:
>>>>> This patchset reduces the boot time for ARM architecture,
>>>>> Exynos boards, and boards with DFU enabled(ARM).
>>>>> 
>>>>> For tested Trats2 device, this was done in three steps.
>>>>> 
>>>>> First was enable the arch memcpy and memset.
>>>>> The second step was enable memset for .bss clear.
>>>>> The third step for reduce this operation is to keep .bss section
>>>>> small as possible.
>>>>> 
>>>>> The .bss section will grow if we have a lot of static variables.
>>>>> This section is cleared before jump to the relocated U-Boot,
>>>>> and it's done word by word. To reduce the time for this step,
>>>>> we can enable arch memset, which uses multiple ARM registers.
>>>>> 
>>>>> For configs with DFU enabled, we can find the dfu buffer in this section,
>>>>> which has at least 8MB (32MB for trats2). This is a lot of useless data,
>>>>> which is not required for standard boot. So this buffer should be dynamic
>>>>> allocated.
>>>>> 
>>>>> Przemyslaw Marczak (3):
>>>>>   exynos: config: enable arch memcpy and arch memset
>>>>>   arm: relocation: clear .bss section with arch memset if defined
>>>>>   dfu: mmc: file buffer: remove static allocation
>>>>> 
>>>>>  arch/arm/lib/crt0.S             | 10 +++++++++-
>>>>>  drivers/dfu/dfu_mmc.c           | 25 ++++++++++++++++++++++---
>>>>>  include/configs/exynos-common.h |  3 +++
>>>>>  3 files changed, 34 insertions(+), 4 deletions(-)
>>>> 
>>>> Looking at the commit messages of this patchset I can conclude that your
>>>> overall boot time reduction is:
>>>> 
>>>> from ~1527ms
>>>> to ~464ms
>>>> 
>>>> This is amazing! Congrats. :)
>>>> 
>>> 
>>> Thank you. I was also amazed.
>>> 
>>> The time results are taken with from the clock cycle counter, I think it's reliable. Some day I would like to check it using the oscilloscope.
>>> 
>>>> We really should in general make more use of the optimized functions and
>>>> take care that the buffers (e.g. the DFU buffer in this case) are used
>>>> in a sane way.
>>>> 
>>>> Thanks,
>>>> Stefan
>>>> 
>>>> 
>>> 
>>> Yes you're right, I thought that Exynos config has enabled arch memcpy/set lib, before I checked this?
>>> 
>> 
>> Those numbers are indeed incredible; I suppose the caches are disabled?
>> 
>> 
> 
> The caches are enabled after the relocation, in one of board_init_r calls.
> 

How big is this .bss section? We?re talking about something that takes 1.5secs to clear
a few MBs of memory? This is horrible.

Even at the optimized case .5secs is too much.

>>> Best regards,
>>> --
>>> Przemyslaw Marczak
>>> Samsung R&D Institute Poland
>>> Samsung Electronics
>>> p.marczak at samsung.com
>> 
>> Regards
>> 
>> ? Pantelis
>> 
>> 
> 
> Best regards,
> -- 
> Przemyslaw Marczak
> Samsung R&D Institute Poland
> Samsung Electronics
> p.marczak at samsung.com

Regards

? Pantelis

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [U-Boot] [PATCH 0/3] arm: reduce .bss section clear time
  2015-01-28 14:34         ` Pantelis Antoniou
@ 2015-01-29 15:26           ` Przemyslaw Marczak
  0 siblings, 0 replies; 25+ messages in thread
From: Przemyslaw Marczak @ 2015-01-29 15:26 UTC (permalink / raw)
  To: u-boot

Hello,

On 01/28/2015 03:34 PM, Pantelis Antoniou wrote:
> Hi Przemyslaw,
>
>> On Jan 28, 2015, at 16:30 , Przemyslaw Marczak <p.marczak@samsung.com> wrote:
>>
>> Hello,
>>
>> On 01/28/2015 03:18 PM, Pantelis Antoniou wrote:
>>> Hi Przemyslaw,
>>>
>>>> On Jan 28, 2015, at 16:10 , Przemyslaw Marczak <p.marczak@samsung.com> wrote:
>>>>
>>>> Hello Stefan,
>>>>
>>>> On 01/28/2015 02:12 PM, Stefan Roese wrote:
>>>>> Hi Przemyslaw,
>>>>>
>>>>> On 28.01.2015 13:55, Przemyslaw Marczak wrote:
>>>>>> This patchset reduces the boot time for ARM architecture,
>>>>>> Exynos boards, and boards with DFU enabled(ARM).
>>>>>>
>>>>>> For tested Trats2 device, this was done in three steps.
>>>>>>
>>>>>> First was enable the arch memcpy and memset.
>>>>>> The second step was enable memset for .bss clear.
>>>>>> The third step for reduce this operation is to keep .bss section
>>>>>> small as possible.
>>>>>>
>>>>>> The .bss section will grow if we have a lot of static variables.
>>>>>> This section is cleared before jump to the relocated U-Boot,
>>>>>> and it's done word by word. To reduce the time for this step,
>>>>>> we can enable arch memset, which uses multiple ARM registers.
>>>>>>
>>>>>> For configs with DFU enabled, we can find the dfu buffer in this section,
>>>>>> which has at least 8MB (32MB for trats2). This is a lot of useless data,
>>>>>> which is not required for standard boot. So this buffer should be dynamic
>>>>>> allocated.
>>>>>>
>>>>>> Przemyslaw Marczak (3):
>>>>>>    exynos: config: enable arch memcpy and arch memset
>>>>>>    arm: relocation: clear .bss section with arch memset if defined
>>>>>>    dfu: mmc: file buffer: remove static allocation
>>>>>>
>>>>>>   arch/arm/lib/crt0.S             | 10 +++++++++-
>>>>>>   drivers/dfu/dfu_mmc.c           | 25 ++++++++++++++++++++++---
>>>>>>   include/configs/exynos-common.h |  3 +++
>>>>>>   3 files changed, 34 insertions(+), 4 deletions(-)
>>>>>
>>>>> Looking at the commit messages of this patchset I can conclude that your
>>>>> overall boot time reduction is:
>>>>>
>>>>> from ~1527ms
>>>>> to ~464ms
>>>>>
>>>>> This is amazing! Congrats. :)
>>>>>
>>>>
>>>> Thank you. I was also amazed.
>>>>
>>>> The time results are taken with from the clock cycle counter, I think it's reliable. Some day I would like to check it using the oscilloscope.
>>>>
>>>>> We really should in general make more use of the optimized functions and
>>>>> take care that the buffers (e.g. the DFU buffer in this case) are used
>>>>> in a sane way.
>>>>>
>>>>> Thanks,
>>>>> Stefan
>>>>>
>>>>>
>>>>
>>>> Yes you're right, I thought that Exynos config has enabled arch memcpy/set lib, before I checked this?
>>>>
>>>
>>> Those numbers are indeed incredible; I suppose the caches are disabled?
>>>
>>>
>>
>> The caches are enabled after the relocation, in one of board_init_r calls.
>>
>
> How big is this .bss section? We?re talking about something that takes 1.5secs to clear
> a few MBs of memory? This is horrible.
>
> Even at the optimized case .5secs is too much.
>

The .bss section was about 32.3 MB, where the 32MB were required for dfu 
file buf. With the third patch, the .bss section is about 300k.
I'm not saying, that this is fast booting. But the achieved improvement 
is really good for this config

>>>> Best regards,
>>>> --
>>>> Przemyslaw Marczak
>>>> Samsung R&D Institute Poland
>>>> Samsung Electronics
>>>> p.marczak at samsung.com
>>>
>>> Regards
>>>
>>> ? Pantelis
>>>
>>>
>>
>> Best regards,
>> --
>> Przemyslaw Marczak
>> Samsung R&D Institute Poland
>> Samsung Electronics
>> p.marczak at samsung.com
>
> Regards
>
> ? Pantelis
>

Best regards,
-- 
Przemyslaw Marczak
Samsung R&D Institute Poland
Samsung Electronics
p.marczak at samsung.com

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [U-Boot] [PATCH 0/3] arm: reduce .bss section clear time
  2015-01-28 12:55 [U-Boot] [PATCH 0/3] arm: reduce .bss section clear time Przemyslaw Marczak
                   ` (3 preceding siblings ...)
  2015-01-28 13:12 ` [U-Boot] [PATCH 0/3] arm: reduce .bss section clear time Stefan Roese
@ 2015-01-29 16:48 ` Przemyslaw Marczak
  2015-02-02  8:46   ` Lukasz Majewski
  4 siblings, 1 reply; 25+ messages in thread
From: Przemyslaw Marczak @ 2015-01-29 16:48 UTC (permalink / raw)
  To: u-boot

Hello,

On 01/28/2015 01:55 PM, Przemyslaw Marczak wrote:
> This patchset reduces the boot time for ARM architecture,
> Exynos boards, and boards with DFU enabled(ARM).
>
> For tested Trats2 device, this was done in three steps.
>
> First was enable the arch memcpy and memset.
> The second step was enable memset for .bss clear.
> The third step for reduce this operation is to keep .bss section
> small as possible.
>
> The .bss section will grow if we have a lot of static variables.
> This section is cleared before jump to the relocated U-Boot,
> and it's done word by word. To reduce the time for this step,
> we can enable arch memset, which uses multiple ARM registers.
>
> For configs with DFU enabled, we can find the dfu buffer in this section,
> which has at least 8MB (32MB for trats2). This is a lot of useless data,
> which is not required for standard boot. So this buffer should be dynamic
> allocated.
>
> Przemyslaw Marczak (3):
>    exynos: config: enable arch memcpy and arch memset
>    arm: relocation: clear .bss section with arch memset if defined
>    dfu: mmc: file buffer: remove static allocation
>
>   arch/arm/lib/crt0.S             | 10 +++++++++-
>   drivers/dfu/dfu_mmc.c           | 25 ++++++++++++++++++++++---
>   include/configs/exynos-common.h |  3 +++
>   3 files changed, 34 insertions(+), 4 deletions(-)
>

So I made some additional tests with the oscilloscope.

Quick about the measurement:
The board is Odroid X2; Exynos4412(this one have gpio header).

Time is measured between change the state of one GPIo pin.

GPIO HI - set the gpio register in "reset" label in: 
arch/arm/cpu/armv7/start.S
GPIO LO - set gpio register with "bootcmd" with setting register by 
"mw.l ..."

${bootdelay}=0

odroid_defconfig  = .bss ~32.3MB

I tested few changes:
- 850ms - no changes:
- 840ms - + CONFIG_USE_ARCH_MEMCPY/MEMSET
- 540ms - .bss memset (patch 2)
- 210ms - dynamic allocation dfu file buf (patch 3)

And the next is interesting.
  odroid_defconfig has more than 80MB for malloc (we need about 64mb for 
the DFU now, to be able write 32MB file).

This is the CONFIG_SYS_MALLOC_LEN. And the memory area for malloc is set 
to 0 in function mem_malloc_init(). So for this config that function
sets more than 80MB to zero.

This is not good, because we shouldn't expect zeroed memory returned by 
malloc pointer. This is a job for calloc.

Especially if some command expects zeroed memory after malloc, probably 
after few next calls - it can crash...

For the testing purposes I changed the memset area in mem_malloc_init().
The CONFIG_SYS_MALLOC_LEN is unchanged, so the dfu can still alloc 2x32MB...

The results:
- 158ms - malloc memset len: 40MB
- 109ms - malloc memset len:  1MB

And a quick test for Trats2 with trace clock cycle counter:
- 333ms - malloc memset len:  1MB (for the standard config it was more 
than 1520ms)

The malloc memset can't be removed now, because it requires check/change 
to calloc a lot of calls, but the board can boot if I set this to 256K.

So the final improvement which could be achieved for the odroid config 
is 850ms -> 109 ms. This is about 8 times faster.

And the tested boards difference:
- Trats2 - 800MHz
- Odroid X2 - 1000MHz
- different BL1/BL2

Now I'm not so sure about the measurement reliability using the trace.

The Trats2 has no gpios header, and now I don't have time for the 
combinations.

So enable the DFU in the board config will increase the boot time.
But the real reason is that the malloc memory area is set to zero on boot.

I think, that we should follow the malloc/calloc/realloc differences 
like in this description: http://man7.org/linux/man-pages/man3/malloc.3.html

Now I go for some holidays, and probably I will be unreachable until 
9-th February. Sorry for troubles.

Best regards,
-- 
Przemyslaw Marczak
Samsung R&D Institute Poland
Samsung Electronics
p.marczak at samsung.com

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [U-Boot] [PATCH 2/3] arm: relocation: clear .bss section with arch memset if defined
  2015-01-28 12:55 ` [U-Boot] [PATCH 2/3] arm: relocation: clear .bss section with arch memset if defined Przemyslaw Marczak
@ 2015-02-01  2:38   ` Albert ARIBAUD
  2015-02-02 17:04     ` Bill Pringlemeir
                       ` (2 more replies)
  0 siblings, 3 replies; 25+ messages in thread
From: Albert ARIBAUD @ 2015-02-01  2:38 UTC (permalink / raw)
  To: u-boot

Hello Przemyslaw,

On Wed, 28 Jan 2015 13:55:42 +0100, Przemyslaw Marczak
<p.marczak@samsung.com> wrote:
> For ARM architecture, enable the CONFIG_USE_ARCH_MEMSET/MEMCPY,
> will highly increase the memset/memcpy performance. This is able
> thanks to the ARM multiple register instructions.
> 
> Unfortunatelly the relocation is done without the cache enabled,
> so it takes some time, but zeroing the BSS memory takes much more
> longer, especially for the configs with big static buffers.
> 
> A quick test confirms, that the boot time improvement after using
> the arch memcpy for relocation has no significant meaning.
> The same test confirms that enable the memset for zeroing BSS,
> reduces the boot time.
> 
> So this patch enables the arch memset for zeroing the BSS after
> the relocation process. For ARM boards, this can be enabled
> in board configs by defining: 'CONFIG_USE_ARCH_MEMSET'.

Since the issue is that zeroing is done one word at a time, could we
not simply clear r3 as well as r2 (possibly even r4 and r5 too) and do
a double (possibly quadruple) write loop? That would avoid calling a
libc routine from the almost sole file in U-Boot where a C environment
is not necessarily granted.

Amicalement,
-- 
Albert.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [U-Boot] [PATCH 0/3] arm: reduce .bss section clear time
  2015-01-29 16:48 ` Przemyslaw Marczak
@ 2015-02-02  8:46   ` Lukasz Majewski
  2015-02-02 18:15     ` Simon Glass
  0 siblings, 1 reply; 25+ messages in thread
From: Lukasz Majewski @ 2015-02-02  8:46 UTC (permalink / raw)
  To: u-boot

Dear All,

> And the next is interesting.
>   odroid_defconfig has more than 80MB for malloc (we need about 64mb
> for the DFU now, to be able write 32MB file).
> 
> This is the CONFIG_SYS_MALLOC_LEN. And the memory area for malloc is
> set to 0 in function mem_malloc_init(). So for this config that
> function sets more than 80MB to zero.
> 
> This is not good, because we shouldn't expect zeroed memory returned
> by malloc pointer. This is a job for calloc.
> 
> Especially if some command expects zeroed memory after malloc,
> probably after few next calls - it can crash...

I think that the above excerpt is _really_ important and should be
discussed.

I've "cut" it from the original post, so it won't get lost between the
lines.

It seems really strange, that malloc() area is cleared after
relocation. Which means that all "first" malloc'ed buffers get
implicitly zeroed.

Przemek is right here that this zeroing shouldn't be performed.

I'm also concerned about potential bugs, which show up (or even worse -
won't show up soon) after this change.

Hence, I would like to ask directly the community about the possible
solutions.

Please look at: ./common/dlmalloc.c mem_alloc_init() function [1].

On the one hand removing memset() at [1] speeds up booting time and
makes malloc() doing what is is supposed to do.

On the other hand there might be in space some boards, which rely on
this memset and without it some wired things may start to happening.


-- 
Best regards,

Lukasz Majewski

Samsung R&D Institute Poland (SRPOL) | Linux Platform Group

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [U-Boot] [PATCH 2/3] arm: relocation: clear .bss section with arch memset if defined
  2015-02-01  2:38   ` Albert ARIBAUD
@ 2015-02-02 17:04     ` Bill Pringlemeir
  2015-02-02 17:25     ` Tom Rini
  2015-02-12 15:37     ` Tom Rini
  2 siblings, 0 replies; 25+ messages in thread
From: Bill Pringlemeir @ 2015-02-02 17:04 UTC (permalink / raw)
  To: u-boot

On  2 Feb 2015, bpringlemeir at nbsps.com wrote:

> On 31 Jan 2015, albert.u.boot at aribaud.net wrote:
>
>> Hello Przemyslaw,
>>
>> On Wed, 28 Jan 2015 13:55:42 +0100, Przemyslaw Marczak
>> <p.marczak@samsung.com> wrote:
>>> For ARM architecture, enable the CONFIG_USE_ARCH_MEMSET/MEMCPY,
>>> will highly increase the memset/memcpy performance. This is able
>>> thanks to the ARM multiple register instructions.
>>>
>>> Unfortunatelly the relocation is done without the cache enabled,
>>> so it takes some time, but zeroing the BSS memory takes much more
>>> longer, especially for the configs with big static buffers.
>>>
>>> A quick test confirms, that the boot time improvement after using
>>> the arch memcpy for relocation has no significant meaning.
>>> The same test confirms that enable the memset for zeroing BSS,
>>> reduces the boot time.
>>>
>>> So this patch enables the arch memset for zeroing the BSS after
>>> the relocation process. For ARM boards, this can be enabled
>>> in board configs by defining: 'CONFIG_USE_ARCH_MEMSET'.
>
>> Since the issue is that zeroing is done one word at a time, could we
>> not simply clear r3 as well as r2 (possibly even r4 and r5 too) and
>> do a double (possibly quadruple) write loop? That would avoid calling
>> a libc routine from the almost sole file in U-Boot where a C
>> environment is not necessarily granted.

I thought the same thing...

> diff --git a/arch/arm/lib/crt0.S b/arch/arm/lib/crt0.S index
> 22df3e5..fab3d2c 100644 --- a/arch/arm/lib/crt0.S +++
> b/arch/arm/lib/crt0.S @@ -115,14 +115,22 @@ here: bl
> c_runtime_cpu_setup /* we still call old routine here */
>
> 	ldr	r0, =__bss_start	/* this is auto-relocated! */
> -	ldr	r1, =__bss_end		/* this is auto-relocated! */
>
> +#ifdef CONFIG_USE_ARCH_MEMSET
> +	ldr	r3, =__bss_end		/* this is auto-relocated! */
> +	mov	r1, #0x00000000		/* prepare zero to clear BSS */
> +
> +	subs	r2, r3, r0		/* r2 = memset len */
> +	bl	memset
> +#else
> +	ldr	r1, =__bss_end		/* this is auto-relocated! */
> 	mov	r2, #0x00000000		/* prepare zero to clear BSS */
>
> clbss_l:cmp	r0, r1			/* while not at end of BSS */
> 	strlo	r2, [r0]		/* clear 32-bit BSS word */
> 	addlo	r0, r0, #4		/* move to next */
> 	blo	clbss_l
> +#endif

This is great news (increase boot speed).  Maybe if this files wasn't
conditional?  Assuming the the 'BSS' is aligned (LDS enforced),

         ldr     r0, =__bss_start        /* this is auto-relocated! */
         ldr     r1, =__bss_end          /* this is auto-relocated! */

+        mov     r2, #0         /* prepare zero to clear BSS */
+        mov     r3, #0
+        mov     r4, #0
+        mov     r5, #0
+        mov     r6, #0
+        mov     r7, #0
+        mov     r8, #0
+        mov     lr, #0
+        b       1f
+        .align  4
  clbss_l:
+         stmia   r0!, {r2-r8,lr}        /* clear 32 BSS word */
+         stmia   r0!, {r2-r8,lr}        /* easy to unroll */
+ 1:      cmp     r0, r1
          blo     clbss_l

The code should work on all ARM versions?  Then every platform would
benefit.  I think the larger portion of the 'ARCH memset' is to handle
alignment issues.  Sometimes the tail/head portion can be handled
efficiently with 'strd', etc which is only on some CPUs.  It is easy to
setup the BSS so that both the head/tail are aligned.... but I think the
above code only requires multiples of 32bytes as total BSS size (it is
easy to jump into the middle of an unrolled loop with 'add pc, rn<<2,
#constant').  The size/iteration can be easily tweaked (8/16/32 bytes).

At least in principal *if* there is some size alignment on BSS it is
fairly easy to write some generic ARM to quickly clear the BSS that will
be just as competitive as any ARCH memset version.  The above code is
adding about 13 words of code.

Fwiw,
Bill Pringlemeir.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [U-Boot] [PATCH 2/3] arm: relocation: clear .bss section with arch memset if defined
  2015-02-01  2:38   ` Albert ARIBAUD
  2015-02-02 17:04     ` Bill Pringlemeir
@ 2015-02-02 17:25     ` Tom Rini
  2015-02-02 17:28       ` Pantelis Antoniou
  2015-02-12 15:37     ` Tom Rini
  2 siblings, 1 reply; 25+ messages in thread
From: Tom Rini @ 2015-02-02 17:25 UTC (permalink / raw)
  To: u-boot

On Sun, Feb 01, 2015 at 03:38:42AM +0100, Albert ARIBAUD wrote:
> Hello Przemyslaw,
> 
> On Wed, 28 Jan 2015 13:55:42 +0100, Przemyslaw Marczak
> <p.marczak@samsung.com> wrote:
> > For ARM architecture, enable the CONFIG_USE_ARCH_MEMSET/MEMCPY,
> > will highly increase the memset/memcpy performance. This is able
> > thanks to the ARM multiple register instructions.
> > 
> > Unfortunatelly the relocation is done without the cache enabled,
> > so it takes some time, but zeroing the BSS memory takes much more
> > longer, especially for the configs with big static buffers.
> > 
> > A quick test confirms, that the boot time improvement after using
> > the arch memcpy for relocation has no significant meaning.
> > The same test confirms that enable the memset for zeroing BSS,
> > reduces the boot time.
> > 
> > So this patch enables the arch memset for zeroing the BSS after
> > the relocation process. For ARM boards, this can be enabled
> > in board configs by defining: 'CONFIG_USE_ARCH_MEMSET'.
> 
> Since the issue is that zeroing is done one word at a time, could we
> not simply clear r3 as well as r2 (possibly even r4 and r5 too) and do
> a double (possibly quadruple) write loop? That would avoid calling a
> libc routine from the almost sole file in U-Boot where a C environment
> is not necessarily granted.

So this brings up something I've wondered about for a long while.  We
have arch/arm/lib/mem{set,cpy}.S which are old copies from the linux
kernel.  The kernel uses them for all ARM platforms.  Why do we not
always use these functions?  I have a very vague notion it was a size
thing...

-- 
Tom
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 836 bytes
Desc: Digital signature
URL: <http://lists.denx.de/pipermail/u-boot/attachments/20150202/0649d2f6/attachment.sig>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [U-Boot] [PATCH 2/3] arm: relocation: clear .bss section with arch memset if defined
  2015-02-02 17:25     ` Tom Rini
@ 2015-02-02 17:28       ` Pantelis Antoniou
  2015-02-02 17:36         ` Tom Rini
  0 siblings, 1 reply; 25+ messages in thread
From: Pantelis Antoniou @ 2015-02-02 17:28 UTC (permalink / raw)
  To: u-boot

Hi Tom,

> On Feb 2, 2015, at 19:25 , Tom Rini <trini@ti.com> wrote:
> 
> On Sun, Feb 01, 2015 at 03:38:42AM +0100, Albert ARIBAUD wrote:
>> Hello Przemyslaw,
>> 
>> On Wed, 28 Jan 2015 13:55:42 +0100, Przemyslaw Marczak
>> <p.marczak@samsung.com> wrote:
>>> For ARM architecture, enable the CONFIG_USE_ARCH_MEMSET/MEMCPY,
>>> will highly increase the memset/memcpy performance. This is able
>>> thanks to the ARM multiple register instructions.
>>> 
>>> Unfortunatelly the relocation is done without the cache enabled,
>>> so it takes some time, but zeroing the BSS memory takes much more
>>> longer, especially for the configs with big static buffers.
>>> 
>>> A quick test confirms, that the boot time improvement after using
>>> the arch memcpy for relocation has no significant meaning.
>>> The same test confirms that enable the memset for zeroing BSS,
>>> reduces the boot time.
>>> 
>>> So this patch enables the arch memset for zeroing the BSS after
>>> the relocation process. For ARM boards, this can be enabled
>>> in board configs by defining: 'CONFIG_USE_ARCH_MEMSET'.
>> 
>> Since the issue is that zeroing is done one word at a time, could we
>> not simply clear r3 as well as r2 (possibly even r4 and r5 too) and do
>> a double (possibly quadruple) write loop? That would avoid calling a
>> libc routine from the almost sole file in U-Boot where a C environment
>> is not necessarily granted.
> 
> So this brings up something I've wondered about for a long while.  We
> have arch/arm/lib/mem{set,cpy}.S which are old copies from the linux
> kernel.  The kernel uses them for all ARM platforms.  Why do we not
> always use these functions?  I have a very vague notion it was a size
> thing?
> 

That is a good question. Are we being hobbled cause of MLO? If so we can
use the short (and slow) methods in that case and use the fast methods
in the normal case. It seems that this is warranted in this case.

However in the particular case of dfu I think it?s best to avoid the large
static buffers. Or if we do use the large buffers let?s put them in a
linker segment that does not get zeroed on start.

> -- 
> Tom

Regards

? Pantelis

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [U-Boot] [PATCH 2/3] arm: relocation: clear .bss section with arch memset if defined
  2015-02-02 17:28       ` Pantelis Antoniou
@ 2015-02-02 17:36         ` Tom Rini
  0 siblings, 0 replies; 25+ messages in thread
From: Tom Rini @ 2015-02-02 17:36 UTC (permalink / raw)
  To: u-boot

On Mon, Feb 02, 2015 at 07:28:14PM +0200, Pantelis Antoniou wrote:
> Hi Tom,
> 
> > On Feb 2, 2015, at 19:25 , Tom Rini <trini@ti.com> wrote:
> > 
> > On Sun, Feb 01, 2015 at 03:38:42AM +0100, Albert ARIBAUD wrote:
> >> Hello Przemyslaw,
> >> 
> >> On Wed, 28 Jan 2015 13:55:42 +0100, Przemyslaw Marczak
> >> <p.marczak@samsung.com> wrote:
> >>> For ARM architecture, enable the CONFIG_USE_ARCH_MEMSET/MEMCPY,
> >>> will highly increase the memset/memcpy performance. This is able
> >>> thanks to the ARM multiple register instructions.
> >>> 
> >>> Unfortunatelly the relocation is done without the cache enabled,
> >>> so it takes some time, but zeroing the BSS memory takes much more
> >>> longer, especially for the configs with big static buffers.
> >>> 
> >>> A quick test confirms, that the boot time improvement after using
> >>> the arch memcpy for relocation has no significant meaning.
> >>> The same test confirms that enable the memset for zeroing BSS,
> >>> reduces the boot time.
> >>> 
> >>> So this patch enables the arch memset for zeroing the BSS after
> >>> the relocation process. For ARM boards, this can be enabled
> >>> in board configs by defining: 'CONFIG_USE_ARCH_MEMSET'.
> >> 
> >> Since the issue is that zeroing is done one word at a time, could we
> >> not simply clear r3 as well as r2 (possibly even r4 and r5 too) and do
> >> a double (possibly quadruple) write loop? That would avoid calling a
> >> libc routine from the almost sole file in U-Boot where a C environment
> >> is not necessarily granted.
> > 
> > So this brings up something I've wondered about for a long while.  We
> > have arch/arm/lib/mem{set,cpy}.S which are old copies from the linux
> > kernel.  The kernel uses them for all ARM platforms.  Why do we not
> > always use these functions?  I have a very vague notion it was a size
> > thing?
> 
> That is a good question. Are we being hobbled cause of MLO? If so we can
> use the short (and slow) methods in that case and use the fast methods
> in the normal case. It seems that this is warranted in this case.

I'm not sure, but I can test easily enough.  But even then we may want
to opt a few targets in to the current (slow) path and make the default
the optimized path.

> However in the particular case of dfu I think it?s best to avoid the large
> static buffers. Or if we do use the large buffers let?s put them in a
> linker segment that does not get zeroed on start.

Yes, I owe the rest of the series my attention too :)

-- 
Tom
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 836 bytes
Desc: Digital signature
URL: <http://lists.denx.de/pipermail/u-boot/attachments/20150202/a2cc4b22/attachment.sig>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [U-Boot] [PATCH 0/3] arm: reduce .bss section clear time
  2015-02-02  8:46   ` Lukasz Majewski
@ 2015-02-02 18:15     ` Simon Glass
  2015-02-05  9:51       ` Lukasz Majewski
  2015-02-13 16:15       ` Przemyslaw Marczak
  0 siblings, 2 replies; 25+ messages in thread
From: Simon Glass @ 2015-02-02 18:15 UTC (permalink / raw)
  To: u-boot

Hi Lukasz,

On 2 February 2015 at 01:46, Lukasz Majewski <l.majewski@samsung.com> wrote:
> Dear All,
>
>> And the next is interesting.
>>   odroid_defconfig has more than 80MB for malloc (we need about 64mb
>> for the DFU now, to be able write 32MB file).
>>
>> This is the CONFIG_SYS_MALLOC_LEN. And the memory area for malloc is
>> set to 0 in function mem_malloc_init(). So for this config that
>> function sets more than 80MB to zero.
>>
>> This is not good, because we shouldn't expect zeroed memory returned
>> by malloc pointer. This is a job for calloc.
>>
>> Especially if some command expects zeroed memory after malloc,
>> probably after few next calls - it can crash...
>
> I think that the above excerpt is _really_ important and should be
> discussed.
>
> I've "cut" it from the original post, so it won't get lost between the
> lines.
>
> It seems really strange, that malloc() area is cleared after
> relocation. Which means that all "first" malloc'ed buffers get
> implicitly zeroed.
>
> Przemek is right here that this zeroing shouldn't be performed.
>
> I'm also concerned about potential bugs, which show up (or even worse -
> won't show up soon) after this change.
>
> Hence, I would like to ask directly the community about the possible
> solutions.
>
> Please look at: ./common/dlmalloc.c mem_alloc_init() function [1].
>
> On the one hand removing memset() at [1] speeds up booting time and
> makes malloc() doing what is is supposed to do.
>
> On the other hand there might be in space some boards, which rely on
> this memset and without it some wired things may start to happening.

I think removing it is a good idea. It was one optimisation that I did
for boot time in the Chromium tree. If you do it now (and Tom agrees)
then there is plenty of time to test for this release cycle. You could
go further and add a test CONFIG which fills it with some other
non-zero value.

Regards,
Simon

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [U-Boot] [PATCH 0/3] arm: reduce .bss section clear time
  2015-02-02 18:15     ` Simon Glass
@ 2015-02-05  9:51       ` Lukasz Majewski
  2015-02-12 16:07         ` Tom Rini
  2015-02-13 16:15       ` Przemyslaw Marczak
  1 sibling, 1 reply; 25+ messages in thread
From: Lukasz Majewski @ 2015-02-05  9:51 UTC (permalink / raw)
  To: u-boot

Hi Simon,

> Hi Lukasz,
> 
> On 2 February 2015 at 01:46, Lukasz Majewski <l.majewski@samsung.com>
> wrote:
> > Dear All,
> >
> >> And the next is interesting.
> >>   odroid_defconfig has more than 80MB for malloc (we need about
> >> 64mb for the DFU now, to be able write 32MB file).
> >>
> >> This is the CONFIG_SYS_MALLOC_LEN. And the memory area for malloc
> >> is set to 0 in function mem_malloc_init(). So for this config that
> >> function sets more than 80MB to zero.
> >>
> >> This is not good, because we shouldn't expect zeroed memory
> >> returned by malloc pointer. This is a job for calloc.
> >>
> >> Especially if some command expects zeroed memory after malloc,
> >> probably after few next calls - it can crash...
> >
> > I think that the above excerpt is _really_ important and should be
> > discussed.
> >
> > I've "cut" it from the original post, so it won't get lost between
> > the lines.
> >
> > It seems really strange, that malloc() area is cleared after
> > relocation. Which means that all "first" malloc'ed buffers get
> > implicitly zeroed.
> >
> > Przemek is right here that this zeroing shouldn't be performed.
> >
> > I'm also concerned about potential bugs, which show up (or even
> > worse - won't show up soon) after this change.
> >
> > Hence, I would like to ask directly the community about the possible
> > solutions.
> >
> > Please look at: ./common/dlmalloc.c mem_alloc_init() function [1].
> >
> > On the one hand removing memset() at [1] speeds up booting time and
> > makes malloc() doing what is is supposed to do.
> >
> > On the other hand there might be in space some boards, which rely on
> > this memset and without it some wired things may start to happening.
> 
> I think removing it is a good idea. It was one optimisation that I did
> for boot time in the Chromium tree. If you do it now (and Tom agrees)
> then there is plenty of time to test for this release cycle. You could
> go further and add a test CONFIG which fills it with some other
> non-zero value.

Tom, is such approach acceptable for you?

> 
> Regards,
> Simon



-- 
Best regards,

Lukasz Majewski

Samsung R&D Institute Poland (SRPOL) | Linux Platform Group

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [U-Boot] [PATCH 2/3] arm: relocation: clear .bss section with arch memset if defined
  2015-02-01  2:38   ` Albert ARIBAUD
  2015-02-02 17:04     ` Bill Pringlemeir
  2015-02-02 17:25     ` Tom Rini
@ 2015-02-12 15:37     ` Tom Rini
  2015-02-13 16:23       ` Przemyslaw Marczak
  2 siblings, 1 reply; 25+ messages in thread
From: Tom Rini @ 2015-02-12 15:37 UTC (permalink / raw)
  To: u-boot

On Sun, Feb 01, 2015 at 03:38:42AM +0100, Albert ARIBAUD wrote:
> Hello Przemyslaw,
> 
> On Wed, 28 Jan 2015 13:55:42 +0100, Przemyslaw Marczak
> <p.marczak@samsung.com> wrote:
> > For ARM architecture, enable the CONFIG_USE_ARCH_MEMSET/MEMCPY,
> > will highly increase the memset/memcpy performance. This is able
> > thanks to the ARM multiple register instructions.
> > 
> > Unfortunatelly the relocation is done without the cache enabled,
> > so it takes some time, but zeroing the BSS memory takes much more
> > longer, especially for the configs with big static buffers.
> > 
> > A quick test confirms, that the boot time improvement after using
> > the arch memcpy for relocation has no significant meaning.
> > The same test confirms that enable the memset for zeroing BSS,
> > reduces the boot time.
> > 
> > So this patch enables the arch memset for zeroing the BSS after
> > the relocation process. For ARM boards, this can be enabled
> > in board configs by defining: 'CONFIG_USE_ARCH_MEMSET'.
> 
> Since the issue is that zeroing is done one word at a time, could we
> not simply clear r3 as well as r2 (possibly even r4 and r5 too) and do
> a double (possibly quadruple) write loop? That would avoid calling a
> libc routine from the almost sole file in U-Boot where a C environment
> is not necessarily granted.

I want to jump up here again.  Note that the arch memset/memcpy routines
are in asm and I don't belive require a C environment.  Why don't we
simply use the asm versions for everyone and backport whatever we need
from the kernel to re-sync there as it's not a choice there and it's a
performance win too?

-- 
Tom
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 836 bytes
Desc: Digital signature
URL: <http://lists.denx.de/pipermail/u-boot/attachments/20150212/5f3913ba/attachment.sig>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [U-Boot] [PATCH 0/3] arm: reduce .bss section clear time
  2015-02-05  9:51       ` Lukasz Majewski
@ 2015-02-12 16:07         ` Tom Rini
  2015-02-13 15:48           ` Przemyslaw Marczak
  0 siblings, 1 reply; 25+ messages in thread
From: Tom Rini @ 2015-02-12 16:07 UTC (permalink / raw)
  To: u-boot

On Thu, Feb 05, 2015 at 10:51:00AM +0100, Lukasz Majewski wrote:
> Hi Simon,
> 
> > Hi Lukasz,
> > 
> > On 2 February 2015 at 01:46, Lukasz Majewski <l.majewski@samsung.com>
> > wrote:
> > > Dear All,
> > >
> > >> And the next is interesting.
> > >>   odroid_defconfig has more than 80MB for malloc (we need about
> > >> 64mb for the DFU now, to be able write 32MB file).
> > >>
> > >> This is the CONFIG_SYS_MALLOC_LEN. And the memory area for malloc
> > >> is set to 0 in function mem_malloc_init(). So for this config that
> > >> function sets more than 80MB to zero.
> > >>
> > >> This is not good, because we shouldn't expect zeroed memory
> > >> returned by malloc pointer. This is a job for calloc.
> > >>
> > >> Especially if some command expects zeroed memory after malloc,
> > >> probably after few next calls - it can crash...
> > >
> > > I think that the above excerpt is _really_ important and should be
> > > discussed.
> > >
> > > I've "cut" it from the original post, so it won't get lost between
> > > the lines.
> > >
> > > It seems really strange, that malloc() area is cleared after
> > > relocation. Which means that all "first" malloc'ed buffers get
> > > implicitly zeroed.
> > >
> > > Przemek is right here that this zeroing shouldn't be performed.
> > >
> > > I'm also concerned about potential bugs, which show up (or even
> > > worse - won't show up soon) after this change.
> > >
> > > Hence, I would like to ask directly the community about the possible
> > > solutions.
> > >
> > > Please look at: ./common/dlmalloc.c mem_alloc_init() function [1].
> > >
> > > On the one hand removing memset() at [1] speeds up booting time and
> > > makes malloc() doing what is is supposed to do.
> > >
> > > On the other hand there might be in space some boards, which rely on
> > > this memset and without it some wired things may start to happening.
> > 
> > I think removing it is a good idea. It was one optimisation that I did
> > for boot time in the Chromium tree. If you do it now (and Tom agrees)
> > then there is plenty of time to test for this release cycle. You could
> > go further and add a test CONFIG which fills it with some other
> > non-zero value.
> 
> Tom, is such approach acceptable for you?

I was thinking at first we should default to a poisoned value.  But
given what we're seeing with generic board updates (lots of boards
aren't even build-tested at every release which isn't really a
surprise), I think the "funky" boards which may exist are probably not
going to be seen for a while anyhow so we'd have to default to a poison
for a long while.  So yes, lets just add a CONFIG option (and Kconfig
line) to optionally do it and default to no memset.

... but I just audited everyone doing "malloc (" and found a few things
to fixup so we really do want to take a poke around.

-- 
Tom
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 836 bytes
Desc: Digital signature
URL: <http://lists.denx.de/pipermail/u-boot/attachments/20150212/18211374/attachment.sig>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [U-Boot] [PATCH 0/3] arm: reduce .bss section clear time
  2015-02-12 16:07         ` Tom Rini
@ 2015-02-13 15:48           ` Przemyslaw Marczak
  2015-02-13 18:13             ` Tom Rini
  0 siblings, 1 reply; 25+ messages in thread
From: Przemyslaw Marczak @ 2015-02-13 15:48 UTC (permalink / raw)
  To: u-boot

Hello,

On 02/12/2015 05:07 PM, Tom Rini wrote:
> On Thu, Feb 05, 2015 at 10:51:00AM +0100, Lukasz Majewski wrote:
>> Hi Simon,
>>
>>> Hi Lukasz,
>>>
>>> On 2 February 2015 at 01:46, Lukasz Majewski <l.majewski@samsung.com>
>>> wrote:
>>>> Dear All,
>>>>
>>>>> And the next is interesting.
>>>>>    odroid_defconfig has more than 80MB for malloc (we need about
>>>>> 64mb for the DFU now, to be able write 32MB file).
>>>>>
>>>>> This is the CONFIG_SYS_MALLOC_LEN. And the memory area for malloc
>>>>> is set to 0 in function mem_malloc_init(). So for this config that
>>>>> function sets more than 80MB to zero.
>>>>>
>>>>> This is not good, because we shouldn't expect zeroed memory
>>>>> returned by malloc pointer. This is a job for calloc.
>>>>>
>>>>> Especially if some command expects zeroed memory after malloc,
>>>>> probably after few next calls - it can crash...
>>>>
>>>> I think that the above excerpt is _really_ important and should be
>>>> discussed.
>>>>
>>>> I've "cut" it from the original post, so it won't get lost between
>>>> the lines.
>>>>
>>>> It seems really strange, that malloc() area is cleared after
>>>> relocation. Which means that all "first" malloc'ed buffers get
>>>> implicitly zeroed.
>>>>
>>>> Przemek is right here that this zeroing shouldn't be performed.
>>>>
>>>> I'm also concerned about potential bugs, which show up (or even
>>>> worse - won't show up soon) after this change.
>>>>
>>>> Hence, I would like to ask directly the community about the possible
>>>> solutions.
>>>>
>>>> Please look at: ./common/dlmalloc.c mem_alloc_init() function [1].
>>>>
>>>> On the one hand removing memset() at [1] speeds up booting time and
>>>> makes malloc() doing what is is supposed to do.
>>>>
>>>> On the other hand there might be in space some boards, which rely on
>>>> this memset and without it some wired things may start to happening.
>>>
>>> I think removing it is a good idea. It was one optimisation that I did
>>> for boot time in the Chromium tree. If you do it now (and Tom agrees)
>>> then there is plenty of time to test for this release cycle. You could
>>> go further and add a test CONFIG which fills it with some other
>>> non-zero value.
>>
>> Tom, is such approach acceptable for you?
>
> I was thinking at first we should default to a poisoned value.  But
> given what we're seeing with generic board updates (lots of boards
> aren't even build-tested at every release which isn't really a
> surprise), I think the "funky" boards which may exist are probably not
> going to be seen for a while anyhow so we'd have to default to a poison
> for a long while.  So yes, lets just add a CONFIG option (and Kconfig
> line) to optionally do it and default to no memset.
>
> ... but I just audited everyone doing "malloc (" and found a few things
> to fixup so we really do want to take a poke around.
>

The present malloc implementation, which includes the memset at init, 
could be little confusing. Because of this memset. the "few" first calls 
of malloc will always return a pointer to the zeroed memory region. Of 
course, the only calloc do the right job and set with zeros the memory 
region, which was used by any previous malloc call.

So, maybe some code expects zeroed memory also when using malloc.

In this case, I would like to skip this memset as an option.

This also requires skip some checking in calloc code.

I will send the patch with such option for Kconfig.

Best regards,
-- 
Przemyslaw Marczak
Samsung R&D Institute Poland
Samsung Electronics
p.marczak at samsung.com

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [U-Boot] [PATCH 0/3] arm: reduce .bss section clear time
  2015-02-02 18:15     ` Simon Glass
  2015-02-05  9:51       ` Lukasz Majewski
@ 2015-02-13 16:15       ` Przemyslaw Marczak
  1 sibling, 0 replies; 25+ messages in thread
From: Przemyslaw Marczak @ 2015-02-13 16:15 UTC (permalink / raw)
  To: u-boot

Hello Simon,

On 02/02/2015 07:15 PM, Simon Glass wrote:
> Hi Lukasz,
>
> On 2 February 2015 at 01:46, Lukasz Majewski <l.majewski@samsung.com> wrote:
>> Dear All,
>>
>>> And the next is interesting.
>>>    odroid_defconfig has more than 80MB for malloc (we need about 64mb
>>> for the DFU now, to be able write 32MB file).
>>>
>>> This is the CONFIG_SYS_MALLOC_LEN. And the memory area for malloc is
>>> set to 0 in function mem_malloc_init(). So for this config that
>>> function sets more than 80MB to zero.
>>>
>>> This is not good, because we shouldn't expect zeroed memory returned
>>> by malloc pointer. This is a job for calloc.
>>>
>>> Especially if some command expects zeroed memory after malloc,
>>> probably after few next calls - it can crash...
>>
>> I think that the above excerpt is _really_ important and should be
>> discussed.
>>
>> I've "cut" it from the original post, so it won't get lost between the
>> lines.
>>
>> It seems really strange, that malloc() area is cleared after
>> relocation. Which means that all "first" malloc'ed buffers get
>> implicitly zeroed.
>>
>> Przemek is right here that this zeroing shouldn't be performed.
>>
>> I'm also concerned about potential bugs, which show up (or even worse -
>> won't show up soon) after this change.
>>
>> Hence, I would like to ask directly the community about the possible
>> solutions.
>>
>> Please look at: ./common/dlmalloc.c mem_alloc_init() function [1].
>>
>> On the one hand removing memset() at [1] speeds up booting time and
>> makes malloc() doing what is is supposed to do.
>>
>> On the other hand there might be in space some boards, which rely on
>> this memset and without it some wired things may start to happening.
>
> I think removing it is a good idea. It was one optimisation that I did
> for boot time in the Chromium tree. If you do it now (and Tom agrees)
> then there is plenty of time to test for this release cycle. You could
> go further and add a test CONFIG which fills it with some other
> non-zero value.
>
> Regards,
> Simon
>

Filling the malloc area with some pattern was a good idea to find out, 
why my trats2 had some issue after skip the memset with zeros in malloc 
init.

And actually the issue was not in malloc call, but it was in calloc.

The present implementation assumes that memory reserved for malloc
is zeroed at init. And the calloc do the check, how much of the 
allocated memory is fresh(doesn't require zeroing).

After skip this fresh memory check, the calloc works fine.

Anyway, I think that this should be optional and tested on every config, 
before enable.

I would like to test something and will send the updated patch set on 
Monday.

Best regards,
-- 
Przemyslaw Marczak
Samsung R&D Institute Poland
Samsung Electronics
p.marczak at samsung.com

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [U-Boot] [PATCH 2/3] arm: relocation: clear .bss section with arch memset if defined
  2015-02-12 15:37     ` Tom Rini
@ 2015-02-13 16:23       ` Przemyslaw Marczak
  0 siblings, 0 replies; 25+ messages in thread
From: Przemyslaw Marczak @ 2015-02-13 16:23 UTC (permalink / raw)
  To: u-boot

Hello Tom,

On 02/12/2015 04:37 PM, Tom Rini wrote:
> On Sun, Feb 01, 2015 at 03:38:42AM +0100, Albert ARIBAUD wrote:
>> Hello Przemyslaw,
>>
>> On Wed, 28 Jan 2015 13:55:42 +0100, Przemyslaw Marczak
>> <p.marczak@samsung.com> wrote:
>>> For ARM architecture, enable the CONFIG_USE_ARCH_MEMSET/MEMCPY,
>>> will highly increase the memset/memcpy performance. This is able
>>> thanks to the ARM multiple register instructions.
>>>
>>> Unfortunatelly the relocation is done without the cache enabled,
>>> so it takes some time, but zeroing the BSS memory takes much more
>>> longer, especially for the configs with big static buffers.
>>>
>>> A quick test confirms, that the boot time improvement after using
>>> the arch memcpy for relocation has no significant meaning.
>>> The same test confirms that enable the memset for zeroing BSS,
>>> reduces the boot time.
>>>
>>> So this patch enables the arch memset for zeroing the BSS after
>>> the relocation process. For ARM boards, this can be enabled
>>> in board configs by defining: 'CONFIG_USE_ARCH_MEMSET'.
>>
>> Since the issue is that zeroing is done one word at a time, could we
>> not simply clear r3 as well as r2 (possibly even r4 and r5 too) and do
>> a double (possibly quadruple) write loop? That would avoid calling a
>> libc routine from the almost sole file in U-Boot where a C environment
>> is not necessarily granted.
>
> I want to jump up here again.  Note that the arch memset/memcpy routines
> are in asm and I don't belive require a C environment.  Why don't we
> simply use the asm versions for everyone and backport whatever we need
> from the kernel to re-sync there as it's not a choice there and it's a
> performance win too?
>

Right, for ARM the mentioned routines doesn't require C env. But if we 
could achieve some improvement in this place, then maybe it has sense to 
add some new code just for bss.

I will try to combine and make some timing tests on Monday.

Best regards,
-- 
Przemyslaw Marczak
Samsung R&D Institute Poland
Samsung Electronics
p.marczak at samsung.com

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [U-Boot] [PATCH 0/3] arm: reduce .bss section clear time
  2015-02-13 15:48           ` Przemyslaw Marczak
@ 2015-02-13 18:13             ` Tom Rini
  0 siblings, 0 replies; 25+ messages in thread
From: Tom Rini @ 2015-02-13 18:13 UTC (permalink / raw)
  To: u-boot

On Fri, Feb 13, 2015 at 04:48:15PM +0100, Przemyslaw Marczak wrote:
> Hello,
> 
> On 02/12/2015 05:07 PM, Tom Rini wrote:
> >On Thu, Feb 05, 2015 at 10:51:00AM +0100, Lukasz Majewski wrote:
> >>Hi Simon,
> >>
> >>>Hi Lukasz,
> >>>
> >>>On 2 February 2015 at 01:46, Lukasz Majewski <l.majewski@samsung.com>
> >>>wrote:
> >>>>Dear All,
> >>>>
> >>>>>And the next is interesting.
> >>>>>   odroid_defconfig has more than 80MB for malloc (we need about
> >>>>>64mb for the DFU now, to be able write 32MB file).
> >>>>>
> >>>>>This is the CONFIG_SYS_MALLOC_LEN. And the memory area for malloc
> >>>>>is set to 0 in function mem_malloc_init(). So for this config that
> >>>>>function sets more than 80MB to zero.
> >>>>>
> >>>>>This is not good, because we shouldn't expect zeroed memory
> >>>>>returned by malloc pointer. This is a job for calloc.
> >>>>>
> >>>>>Especially if some command expects zeroed memory after malloc,
> >>>>>probably after few next calls - it can crash...
> >>>>
> >>>>I think that the above excerpt is _really_ important and should be
> >>>>discussed.
> >>>>
> >>>>I've "cut" it from the original post, so it won't get lost between
> >>>>the lines.
> >>>>
> >>>>It seems really strange, that malloc() area is cleared after
> >>>>relocation. Which means that all "first" malloc'ed buffers get
> >>>>implicitly zeroed.
> >>>>
> >>>>Przemek is right here that this zeroing shouldn't be performed.
> >>>>
> >>>>I'm also concerned about potential bugs, which show up (or even
> >>>>worse - won't show up soon) after this change.
> >>>>
> >>>>Hence, I would like to ask directly the community about the possible
> >>>>solutions.
> >>>>
> >>>>Please look at: ./common/dlmalloc.c mem_alloc_init() function [1].
> >>>>
> >>>>On the one hand removing memset() at [1] speeds up booting time and
> >>>>makes malloc() doing what is is supposed to do.
> >>>>
> >>>>On the other hand there might be in space some boards, which rely on
> >>>>this memset and without it some wired things may start to happening.
> >>>
> >>>I think removing it is a good idea. It was one optimisation that I did
> >>>for boot time in the Chromium tree. If you do it now (and Tom agrees)
> >>>then there is plenty of time to test for this release cycle. You could
> >>>go further and add a test CONFIG which fills it with some other
> >>>non-zero value.
> >>
> >>Tom, is such approach acceptable for you?
> >
> >I was thinking at first we should default to a poisoned value.  But
> >given what we're seeing with generic board updates (lots of boards
> >aren't even build-tested at every release which isn't really a
> >surprise), I think the "funky" boards which may exist are probably not
> >going to be seen for a while anyhow so we'd have to default to a poison
> >for a long while.  So yes, lets just add a CONFIG option (and Kconfig
> >line) to optionally do it and default to no memset.
> >
> >... but I just audited everyone doing "malloc (" and found a few things
> >to fixup so we really do want to take a poke around.
> >
> 
> The present malloc implementation, which includes the memset at
> init, could be little confusing. Because of this memset. the "few"
> first calls of malloc will always return a pointer to the zeroed
> memory region. Of course, the only calloc do the right job and set
> with zeros the memory region, which was used by any previous malloc
> call.
> 
> So, maybe some code expects zeroed memory also when using malloc.
> 
> In this case, I would like to skip this memset as an option.
> 
> This also requires skip some checking in calloc code.
> 
> I will send the patch with such option for Kconfig.

There might be a few things which are depending on this odd behviour but
they shouldn't.  Most things are zeroing out twice since it's normal to
malloc+memset if you care about contents being zero.  The only real
hang-up I see is that we're half way to release, more or less.  Maybe we
do this early post v2015.04 release to give more time to catch the odd
cases?  And spend some time now auditing.

-- 
Tom
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 836 bytes
Desc: Digital signature
URL: <http://lists.denx.de/pipermail/u-boot/attachments/20150213/2eea5fcb/attachment.sig>

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2015-02-13 18:13 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-01-28 12:55 [U-Boot] [PATCH 0/3] arm: reduce .bss section clear time Przemyslaw Marczak
2015-01-28 12:55 ` [U-Boot] [PATCH 1/3] exynos: config: enable arch memcpy and arch memset Przemyslaw Marczak
2015-01-28 12:55 ` [U-Boot] [PATCH 2/3] arm: relocation: clear .bss section with arch memset if defined Przemyslaw Marczak
2015-02-01  2:38   ` Albert ARIBAUD
2015-02-02 17:04     ` Bill Pringlemeir
2015-02-02 17:25     ` Tom Rini
2015-02-02 17:28       ` Pantelis Antoniou
2015-02-02 17:36         ` Tom Rini
2015-02-12 15:37     ` Tom Rini
2015-02-13 16:23       ` Przemyslaw Marczak
2015-01-28 12:55 ` [U-Boot] [PATCH 3/3] dfu: mmc: file buffer: remove static allocation Przemyslaw Marczak
2015-01-28 13:12 ` [U-Boot] [PATCH 0/3] arm: reduce .bss section clear time Stefan Roese
2015-01-28 14:10   ` Przemyslaw Marczak
2015-01-28 14:18     ` Pantelis Antoniou
2015-01-28 14:30       ` Przemyslaw Marczak
2015-01-28 14:34         ` Pantelis Antoniou
2015-01-29 15:26           ` Przemyslaw Marczak
2015-01-29 16:48 ` Przemyslaw Marczak
2015-02-02  8:46   ` Lukasz Majewski
2015-02-02 18:15     ` Simon Glass
2015-02-05  9:51       ` Lukasz Majewski
2015-02-12 16:07         ` Tom Rini
2015-02-13 15:48           ` Przemyslaw Marczak
2015-02-13 18:13             ` Tom Rini
2015-02-13 16:15       ` Przemyslaw Marczak

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.