All of lore.kernel.org
 help / color / mirror / Atom feed
* [U-Boot] [PATCH 0/3] RFC: Patches to reduce TPL code size
@ 2017-03-26 23:38 Simon Glass
  2017-03-26 23:38 ` [U-Boot] [PATCH 1/3] Makefile: Correct dependency race condition with TPL Simon Glass
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Simon Glass @ 2017-03-26 23:38 UTC (permalink / raw)
  To: u-boot

With the rockchip 'rock' board some build and code size problems have come
to light with TPL. This series provides a few ideas to improve things.


Simon Glass (3):
  Makefile: Correct dependency race condition with TPL
  string: Provide a slimmed-down memset()
  Makefile: Provide an option to select SPL or TPL

 Makefile               | 3 ++-
 lib/Kconfig            | 9 +++++++++
 lib/string.c           | 6 ++++--
 scripts/Kbuild.include | 6 ++++++
 scripts/Makefile.spl   | 6 ++++++
 5 files changed, 27 insertions(+), 3 deletions(-)

-- 
2.12.1.578.ge9c3154ca4-goog

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [U-Boot] [PATCH 1/3] Makefile: Correct dependency race condition with TPL
  2017-03-26 23:38 [U-Boot] [PATCH 0/3] RFC: Patches to reduce TPL code size Simon Glass
@ 2017-03-26 23:38 ` Simon Glass
  2017-03-27 19:39   ` Heiko Stuebner
  2017-03-26 23:38 ` [U-Boot] [PATCH 2/3] string: Provide a slimmed-down memset() Simon Glass
  2017-03-26 23:38 ` [U-Boot] [PATCH 3/3] Makefile: Provide an option to select SPL or TPL Simon Glass
  2 siblings, 1 reply; 12+ messages in thread
From: Simon Glass @ 2017-03-26 23:38 UTC (permalink / raw)
  To: u-boot

At present we sometimes see the following build error when building on a
machine with multiple cores.

+make[2]: *** No rule to make target 'dts/dt.dtb', needed by 'tpl/u-boot-tpl.dtb'.  Stop.

Add a dependency to correct this.

Signed-off-by: Simon Glass <sjg@chromium.org>
---

 Makefile | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/Makefile b/Makefile
index 1001bc5a7b..5e82b78ade 100644
--- a/Makefile
+++ b/Makefile
@@ -1351,7 +1351,8 @@ spl/u-boot-spl.sfp: spl/u-boot-spl
 spl/boot.bin: spl/u-boot-spl
 	@:
 
-tpl/u-boot-tpl.bin: tools prepare
+tpl/u-boot-tpl.bin: tools prepare \
+		$(if $(CONFIG_OF_SEPARATE)$(CONFIG_SPL_OF_PLATDATA),dts/dt.dtb)
 	$(Q)$(MAKE) obj=tpl -f $(srctree)/scripts/Makefile.spl all
 
 TAG_SUBDIRS := $(patsubst %,$(srctree)/%,$(u-boot-dirs) include)
-- 
2.12.1.578.ge9c3154ca4-goog

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [U-Boot] [PATCH 2/3] string: Provide a slimmed-down memset()
  2017-03-26 23:38 [U-Boot] [PATCH 0/3] RFC: Patches to reduce TPL code size Simon Glass
  2017-03-26 23:38 ` [U-Boot] [PATCH 1/3] Makefile: Correct dependency race condition with TPL Simon Glass
@ 2017-03-26 23:38 ` Simon Glass
  2017-03-27  7:14   ` Alexander Graf
                     ` (2 more replies)
  2017-03-26 23:38 ` [U-Boot] [PATCH 3/3] Makefile: Provide an option to select SPL or TPL Simon Glass
  2 siblings, 3 replies; 12+ messages in thread
From: Simon Glass @ 2017-03-26 23:38 UTC (permalink / raw)
  To: u-boot

Most of the time the optimised memset() is what we want. For extreme
situations such as TPL it may be too large. For example on the 'rock'
board, using a simple loop saves a useful 48 bytes. With gcc 4.9 and
the rodata bug, this patch is enough to reduce the TPL image below the
limit.

Signed-off-by: Simon Glass <sjg@chromium.org>
---

 lib/Kconfig  | 9 +++++++++
 lib/string.c | 6 ++++--
 2 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/lib/Kconfig b/lib/Kconfig
index 65c01573e1..5bf512d8c0 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -52,6 +52,15 @@ config LIB_RAND
 	help
 	  This library provides pseudo-random number generator functions.
 
+config FAST_MEMSET
+	bool "Use an optimised memset()"
+	default y
+	help
+	  The faster memset() is the arch-specific one (if available) enabled
+	  by CONFIG_USE_ARCH_MEMSET. If that is not enabled, we can still get
+	  better performance by write a word at a time. Disable this option
+	  to reduce code size slightly at the cost of some speed.
+
 source lib/dhry/Kconfig
 
 source lib/rsa/Kconfig
diff --git a/lib/string.c b/lib/string.c
index 67d5f6a421..159493ed17 100644
--- a/lib/string.c
+++ b/lib/string.c
@@ -437,8 +437,10 @@ char *strswab(const char *s)
 void * memset(void * s,int c,size_t count)
 {
 	unsigned long *sl = (unsigned long *) s;
-	unsigned long cl = 0;
 	char *s8;
+
+#ifdef CONFIG_FAST_MEMSET
+	unsigned long cl = 0;
 	int i;
 
 	/* do it one word at a time (32 bits or 64 bits) while possible */
@@ -452,7 +454,7 @@ void * memset(void * s,int c,size_t count)
 			count -= sizeof(*sl);
 		}
 	}
-	/* fill 8 bits at a time */
+#endif	/* fill 8 bits at a time */
 	s8 = (char *)sl;
 	while (count--)
 		*s8++ = c;
-- 
2.12.1.578.ge9c3154ca4-goog

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [U-Boot] [PATCH 3/3] Makefile: Provide an option to select SPL or TPL
  2017-03-26 23:38 [U-Boot] [PATCH 0/3] RFC: Patches to reduce TPL code size Simon Glass
  2017-03-26 23:38 ` [U-Boot] [PATCH 1/3] Makefile: Correct dependency race condition with TPL Simon Glass
  2017-03-26 23:38 ` [U-Boot] [PATCH 2/3] string: Provide a slimmed-down memset() Simon Glass
@ 2017-03-26 23:38 ` Simon Glass
  2 siblings, 0 replies; 12+ messages in thread
From: Simon Glass @ 2017-03-26 23:38 UTC (permalink / raw)
  To: u-boot

At present we have SPL_ which can be used in Makefiles to select between
normal and SPL CONFIGs like this:

    obj-$(CONFIG_$(SPL_)DM)		+= core/

When TPL is being built, SPL_ has the value 'SPL' which is generally a
good idea since they tend to follow each other. But in extreme situations
we may want to distinugish between SPL and TPL. For example we may not
want to enable CONFIG_DM with TPL.

Add a new SPL_TPL_ variable which is set to either empty (for U-Boot
proper), 'SPL' or 'TPL'. This may prove useful with TPL-specific options.

Signed-off-by: Simon Glass <sjg@chromium.org>
---

 scripts/Kbuild.include | 6 ++++++
 scripts/Makefile.spl   | 6 ++++++
 2 files changed, 12 insertions(+)

diff --git a/scripts/Kbuild.include b/scripts/Kbuild.include
index 1b62aedb00..a3a5c59d0d 100644
--- a/scripts/Kbuild.include
+++ b/scripts/Kbuild.include
@@ -321,6 +321,12 @@ endif
 
 ifdef CONFIG_SPL_BUILD
 SPL_ := SPL_
+ifeq ($(CONFIG_TPL_BUILD),y)
+SPL_TPL_ := TPL_
+else
+SPL_TPL_ := SPL_
+endif
 else
 SPL_ :=
+SPL_TPL_ :=
 endif
diff --git a/scripts/Makefile.spl b/scripts/Makefile.spl
index 5370648e85..4485ea8812 100644
--- a/scripts/Makefile.spl
+++ b/scripts/Makefile.spl
@@ -37,8 +37,14 @@ endif
 
 ifdef CONFIG_SPL_BUILD
 SPL_ := SPL_
+ifeq ($(CONFIG_TPL_BUILD),y)
+SPL_TPL_ := TPL_
+else
+SPL_TPL_ := SPL_
+endif
 else
 SPL_ :=
+SPL_TPL_ :=
 endif
 
 include $(srctree)/config.mk
-- 
2.12.1.578.ge9c3154ca4-goog

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [U-Boot] [PATCH 2/3] string: Provide a slimmed-down memset()
  2017-03-26 23:38 ` [U-Boot] [PATCH 2/3] string: Provide a slimmed-down memset() Simon Glass
@ 2017-03-27  7:14   ` Alexander Graf
  2017-03-27 15:17     ` Heiko Stuebner
  2017-03-27 19:55   ` Heiko Stuebner
  2017-03-30 11:14   ` [U-Boot] [PATCH v2] " Heiko Stuebner
  2 siblings, 1 reply; 12+ messages in thread
From: Alexander Graf @ 2017-03-27  7:14 UTC (permalink / raw)
  To: u-boot



On 27/03/2017 01:38, Simon Glass wrote:
> Most of the time the optimised memset() is what we want. For extreme
> situations such as TPL it may be too large. For example on the 'rock'
> board, using a simple loop saves a useful 48 bytes. With gcc 4.9 and
> the rodata bug, this patch is enough to reduce the TPL image below the
> limit.
>
> Signed-off-by: Simon Glass <sjg@chromium.org>
> ---
>
>  lib/Kconfig  | 9 +++++++++
>  lib/string.c | 6 ++++--
>  2 files changed, 13 insertions(+), 2 deletions(-)
>
> diff --git a/lib/Kconfig b/lib/Kconfig
> index 65c01573e1..5bf512d8c0 100644
> --- a/lib/Kconfig
> +++ b/lib/Kconfig
> @@ -52,6 +52,15 @@ config LIB_RAND
>  	help
>  	  This library provides pseudo-random number generator functions.
>
> +config FAST_MEMSET
> +	bool "Use an optimised memset()"
> +	default y
> +	help
> +	  The faster memset() is the arch-specific one (if available) enabled
> +	  by CONFIG_USE_ARCH_MEMSET. If that is not enabled, we can still get
> +	  better performance by write a word at a time. Disable this option
> +	  to reduce code size slightly at the cost of some speed.

The comment sounds slightly confused - it took me a few times of reading 
it until I grasped what it was trying to tell me :).

> +
>  source lib/dhry/Kconfig
>
>  source lib/rsa/Kconfig
> diff --git a/lib/string.c b/lib/string.c
> index 67d5f6a421..159493ed17 100644
> --- a/lib/string.c
> +++ b/lib/string.c
> @@ -437,8 +437,10 @@ char *strswab(const char *s)
>  void * memset(void * s,int c,size_t count)
>  {
>  	unsigned long *sl = (unsigned long *) s;
> -	unsigned long cl = 0;
>  	char *s8;
> +
> +#ifdef CONFIG_FAST_MEMSET
> +	unsigned long cl = 0;
>  	int i;
>
>  	/* do it one word at a time (32 bits or 64 bits) while possible */
> @@ -452,7 +454,7 @@ void * memset(void * s,int c,size_t count)
>  			count -= sizeof(*sl);
>  		}
>  	}
> -	/* fill 8 bits at a time */
> +#endif	/* fill 8 bits at a time */

So while this is all neat, a few ideas:

1) Would having memset in a header improve things even more? After all, 
each external function call clobbers registers that you need to 
save/restore...

2) How much would GOLD save you? Have you tried? U-Boot is small enough 
of a code base that global optimizations should be able to give 
significant size savings.


Alex

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [U-Boot] [PATCH 2/3] string: Provide a slimmed-down memset()
  2017-03-27  7:14   ` Alexander Graf
@ 2017-03-27 15:17     ` Heiko Stuebner
  2017-03-27 21:16       ` Alexander Graf
  0 siblings, 1 reply; 12+ messages in thread
From: Heiko Stuebner @ 2017-03-27 15:17 UTC (permalink / raw)
  To: u-boot

Am Montag, 27. März 2017, 09:14:47 CEST schrieb Alexander Graf:
> 
> On 27/03/2017 01:38, Simon Glass wrote:
> > Most of the time the optimised memset() is what we want. For extreme
> > situations such as TPL it may be too large. For example on the 'rock'
> > board, using a simple loop saves a useful 48 bytes. With gcc 4.9 and
> > the rodata bug, this patch is enough to reduce the TPL image below the
> > limit.
> >
> > Signed-off-by: Simon Glass <sjg@chromium.org>
> > ---
> >
> >  lib/Kconfig  | 9 +++++++++
> >  lib/string.c | 6 ++++--
> >  2 files changed, 13 insertions(+), 2 deletions(-)
> >
> > diff --git a/lib/Kconfig b/lib/Kconfig
> > index 65c01573e1..5bf512d8c0 100644
> > --- a/lib/Kconfig
> > +++ b/lib/Kconfig
> > @@ -52,6 +52,15 @@ config LIB_RAND
> >  	help
> >  	  This library provides pseudo-random number generator functions.
> >
> > +config FAST_MEMSET
> > +	bool "Use an optimised memset()"
> > +	default y
> > +	help
> > +	  The faster memset() is the arch-specific one (if available) enabled
> > +	  by CONFIG_USE_ARCH_MEMSET. If that is not enabled, we can still get
> > +	  better performance by write a word at a time. Disable this option
> > +	  to reduce code size slightly at the cost of some speed.
> 
> The comment sounds slightly confused - it took me a few times of reading 
> it until I grasped what it was trying to tell me :).
> 
> > +
> >  source lib/dhry/Kconfig
> >
> >  source lib/rsa/Kconfig
> > diff --git a/lib/string.c b/lib/string.c
> > index 67d5f6a421..159493ed17 100644
> > --- a/lib/string.c
> > +++ b/lib/string.c
> > @@ -437,8 +437,10 @@ char *strswab(const char *s)
> >  void * memset(void * s,int c,size_t count)
> >  {
> >  	unsigned long *sl = (unsigned long *) s;
> > -	unsigned long cl = 0;
> >  	char *s8;
> > +
> > +#ifdef CONFIG_FAST_MEMSET
> > +	unsigned long cl = 0;
> >  	int i;
> >
> >  	/* do it one word at a time (32 bits or 64 bits) while possible */
> > @@ -452,7 +454,7 @@ void * memset(void * s,int c,size_t count)
> >  			count -= sizeof(*sl);
> >  		}
> >  	}
> > -	/* fill 8 bits at a time */
> > +#endif	/* fill 8 bits at a time */
> 
> So while this is all neat, a few ideas:
> 
> 1) Would having memset in a header improve things even more? After all, 
> each external function call clobbers registers that you need to 
> save/restore...

I'd guess it really depends on the size constraints. The regular
libgeneric memset compiles on my rk3188 tpl to a total of
64bytes on both gcc-4.9 and gcc-6.3 while Simon's fast-memset
comes down to 14bytes on my rk3188.

On the rk3188 the only memset user is board_init_f, so here memset
is called only once without needing to save registers and I'd guess if an
implementation really is that size-constrained to worry about 50bytes
this one caller will probably always be the only one?


> 2) How much would GOLD save you? Have you tried? U-Boot is small enough 
> of a code base that global optimizations should be able to give 
> significant size savings.

I think the issue that this is trying to solve is to allow more
toolchains to be used and thus make rebuilds on changes work on a lot
of boards at the same time with random toolchains.

gcc-6.3 already produces way smaller results (well within the size
constraints the rk3188 has) than for example the gcc-4.9 used by
buildman as baseline toolchain.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [U-Boot] [PATCH 1/3] Makefile: Correct dependency race condition with TPL
  2017-03-26 23:38 ` [U-Boot] [PATCH 1/3] Makefile: Correct dependency race condition with TPL Simon Glass
@ 2017-03-27 19:39   ` Heiko Stuebner
  2017-04-02  0:05     ` Simon Glass
  0 siblings, 1 reply; 12+ messages in thread
From: Heiko Stuebner @ 2017-03-27 19:39 UTC (permalink / raw)
  To: u-boot

Am Sonntag, 26. März 2017, 17:38:15 CEST schrieb Simon Glass:
> At present we sometimes see the following build error when building on a
> machine with multiple cores.
> 
> +make[2]: *** No rule to make target 'dts/dt.dtb', needed by 'tpl/u-boot-tpl.dtb'.  Stop.
> 
> Add a dependency to correct this.
> 
> Signed-off-by: Simon Glass <sjg@chromium.org>

Fixes the build of rk3188-rock board using buildman for me with more than
one job, which failed very reliable before, so

Tested-by: Heiko Stuebner <heiko@sntech.de>

> ---
> 
>  Makefile | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/Makefile b/Makefile
> index 1001bc5a7b..5e82b78ade 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -1351,7 +1351,8 @@ spl/u-boot-spl.sfp: spl/u-boot-spl
>  spl/boot.bin: spl/u-boot-spl
>  	@:
>  
> -tpl/u-boot-tpl.bin: tools prepare
> +tpl/u-boot-tpl.bin: tools prepare \
> +		$(if $(CONFIG_OF_SEPARATE)$(CONFIG_SPL_OF_PLATDATA),dts/dt.dtb)
>  	$(Q)$(MAKE) obj=tpl -f $(srctree)/scripts/Makefile.spl all
>  
>  TAG_SUBDIRS := $(patsubst %,$(srctree)/%,$(u-boot-dirs) include)
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [U-Boot] [PATCH 2/3] string: Provide a slimmed-down memset()
  2017-03-26 23:38 ` [U-Boot] [PATCH 2/3] string: Provide a slimmed-down memset() Simon Glass
  2017-03-27  7:14   ` Alexander Graf
@ 2017-03-27 19:55   ` Heiko Stuebner
  2017-03-30 11:14   ` [U-Boot] [PATCH v2] " Heiko Stuebner
  2 siblings, 0 replies; 12+ messages in thread
From: Heiko Stuebner @ 2017-03-27 19:55 UTC (permalink / raw)
  To: u-boot

Am Sonntag, 26. März 2017, 17:38:16 CEST schrieb Simon Glass:
> Most of the time the optimised memset() is what we want. For extreme
> situations such as TPL it may be too large. For example on the 'rock'
> board, using a simple loop saves a useful 48 bytes. With gcc 4.9 and
> the rodata bug, this patch is enough to reduce the TPL image below the
> limit.
> 
> Signed-off-by: Simon Glass <sjg@chromium.org>

This brings down the rk3188-rock tpl from 1020 to 972 bytes (with a
1020 byte size limit for the tpl) even with gcc-4.9 and down to 748 bytes
on gcc-6.3.

I was using the original memset in all tests before, so am quite sure
it should work without issues, but cannot test it on actual hardware
this week.


Heiko


> ---
> 
>  lib/Kconfig  | 9 +++++++++
>  lib/string.c | 6 ++++--
>  2 files changed, 13 insertions(+), 2 deletions(-)
> 
> diff --git a/lib/Kconfig b/lib/Kconfig
> index 65c01573e1..5bf512d8c0 100644
> --- a/lib/Kconfig
> +++ b/lib/Kconfig
> @@ -52,6 +52,15 @@ config LIB_RAND
>  	help
>  	  This library provides pseudo-random number generator functions.
>  
> +config FAST_MEMSET
> +	bool "Use an optimised memset()"
> +	default y
> +	help
> +	  The faster memset() is the arch-specific one (if available) enabled
> +	  by CONFIG_USE_ARCH_MEMSET. If that is not enabled, we can still get
> +	  better performance by write a word at a time. Disable this option
> +	  to reduce code size slightly at the cost of some speed.
> +
>  source lib/dhry/Kconfig
>  
>  source lib/rsa/Kconfig
> diff --git a/lib/string.c b/lib/string.c
> index 67d5f6a421..159493ed17 100644
> --- a/lib/string.c
> +++ b/lib/string.c
> @@ -437,8 +437,10 @@ char *strswab(const char *s)
>  void * memset(void * s,int c,size_t count)
>  {
>  	unsigned long *sl = (unsigned long *) s;
> -	unsigned long cl = 0;
>  	char *s8;
> +
> +#ifdef CONFIG_FAST_MEMSET
> +	unsigned long cl = 0;
>  	int i;
>  
>  	/* do it one word at a time (32 bits or 64 bits) while possible */
> @@ -452,7 +454,7 @@ void * memset(void * s,int c,size_t count)
>  			count -= sizeof(*sl);
>  		}
>  	}
> -	/* fill 8 bits at a time */
> +#endif	/* fill 8 bits at a time */
>  	s8 = (char *)sl;
>  	while (count--)
>  		*s8++ = c;
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [U-Boot] [PATCH 2/3] string: Provide a slimmed-down memset()
  2017-03-27 15:17     ` Heiko Stuebner
@ 2017-03-27 21:16       ` Alexander Graf
  2017-03-28 12:34         ` Heiko Stuebner
  0 siblings, 1 reply; 12+ messages in thread
From: Alexander Graf @ 2017-03-27 21:16 UTC (permalink / raw)
  To: u-boot



On 27/03/2017 17:17, Heiko Stuebner wrote:
> Am Montag, 27. März 2017, 09:14:47 CEST schrieb Alexander Graf:
>>
>> On 27/03/2017 01:38, Simon Glass wrote:
>>> Most of the time the optimised memset() is what we want. For extreme
>>> situations such as TPL it may be too large. For example on the 'rock'
>>> board, using a simple loop saves a useful 48 bytes. With gcc 4.9 and
>>> the rodata bug, this patch is enough to reduce the TPL image below the
>>> limit.
>>>
>>> Signed-off-by: Simon Glass <sjg@chromium.org>
>>> ---
>>>
>>>  lib/Kconfig  | 9 +++++++++
>>>  lib/string.c | 6 ++++--
>>>  2 files changed, 13 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/lib/Kconfig b/lib/Kconfig
>>> index 65c01573e1..5bf512d8c0 100644
>>> --- a/lib/Kconfig
>>> +++ b/lib/Kconfig
>>> @@ -52,6 +52,15 @@ config LIB_RAND
>>>  	help
>>>  	  This library provides pseudo-random number generator functions.
>>>
>>> +config FAST_MEMSET
>>> +	bool "Use an optimised memset()"
>>> +	default y
>>> +	help
>>> +	  The faster memset() is the arch-specific one (if available) enabled
>>> +	  by CONFIG_USE_ARCH_MEMSET. If that is not enabled, we can still get
>>> +	  better performance by write a word at a time. Disable this option
>>> +	  to reduce code size slightly at the cost of some speed.
>>
>> The comment sounds slightly confused - it took me a few times of reading
>> it until I grasped what it was trying to tell me :).
>>
>>> +
>>>  source lib/dhry/Kconfig
>>>
>>>  source lib/rsa/Kconfig
>>> diff --git a/lib/string.c b/lib/string.c
>>> index 67d5f6a421..159493ed17 100644
>>> --- a/lib/string.c
>>> +++ b/lib/string.c
>>> @@ -437,8 +437,10 @@ char *strswab(const char *s)
>>>  void * memset(void * s,int c,size_t count)
>>>  {
>>>  	unsigned long *sl = (unsigned long *) s;
>>> -	unsigned long cl = 0;
>>>  	char *s8;
>>> +
>>> +#ifdef CONFIG_FAST_MEMSET
>>> +	unsigned long cl = 0;
>>>  	int i;
>>>
>>>  	/* do it one word at a time (32 bits or 64 bits) while possible */
>>> @@ -452,7 +454,7 @@ void * memset(void * s,int c,size_t count)
>>>  			count -= sizeof(*sl);
>>>  		}
>>>  	}
>>> -	/* fill 8 bits at a time */
>>> +#endif	/* fill 8 bits at a time */
>>
>> So while this is all neat, a few ideas:
>>
>> 1) Would having memset in a header improve things even more? After all,
>> each external function call clobbers registers that you need to
>> save/restore...
>
> I'd guess it really depends on the size constraints. The regular
> libgeneric memset compiles on my rk3188 tpl to a total of
> 64bytes on both gcc-4.9 and gcc-6.3 while Simon's fast-memset
> comes down to 14bytes on my rk3188.
>
> On the rk3188 the only memset user is board_init_f, so here memset
> is called only once without needing to save registers and I'd guess if an
> implementation really is that size-constrained to worry about 50bytes
> this one caller will probably always be the only one?

I'm not sure I follow. If you put it into a header, the compiler has a 
better chance of evicting untaken code paths and optimize register usage 
over object linked variants (unless you use GOLD). I was mostly 
wondering whether that would already give you the savings without 
introducing a complicated #ifdef that is going to bitrot over time :).

I'm just slightly worried about the massive number of preprocessor 
excludes that happen in U-Boot in general. It seems like something 
that's really hard to ever have full testing coverage on.

>> 2) How much would GOLD save you? Have you tried? U-Boot is small enough
>> of a code base that global optimizations should be able to give
>> significant size savings.
>
> I think the issue that this is trying to solve is to allow more
> toolchains to be used and thus make rebuilds on changes work on a lot
> of boards at the same time with random toolchains.
>
> gcc-6.3 already produces way smaller results (well within the size
> constraints the rk3188 has) than for example the gcc-4.9 used by
> buildman as baseline toolchain.

Ah, I see. So 4.9 does not have -lto? There's a good chance my gut 
feeling that GOLD actually saves anything is wrong - I don't know. Has 
anyone done the numbers? Then we would have something to actually base 
gut feeling on.

Size is always a serious constraint in U-Boot, especially in SPL 
environments. If we can include one more tool in our portfolio to 
optimize size across the board, I'm all for it. This patch just feels 
slightly short-term - but I'm definitely not nack'ing it :).


Alex

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [U-Boot] [PATCH 2/3] string: Provide a slimmed-down memset()
  2017-03-27 21:16       ` Alexander Graf
@ 2017-03-28 12:34         ` Heiko Stuebner
  0 siblings, 0 replies; 12+ messages in thread
From: Heiko Stuebner @ 2017-03-28 12:34 UTC (permalink / raw)
  To: u-boot

Am Montag, 27. März 2017, 23:16:45 CEST schrieb Alexander Graf:
> 
> On 27/03/2017 17:17, Heiko Stuebner wrote:
> > Am Montag, 27. März 2017, 09:14:47 CEST schrieb Alexander Graf:
> >>
> >> On 27/03/2017 01:38, Simon Glass wrote:
> >>> Most of the time the optimised memset() is what we want. For extreme
> >>> situations such as TPL it may be too large. For example on the 'rock'
> >>> board, using a simple loop saves a useful 48 bytes. With gcc 4.9 and
> >>> the rodata bug, this patch is enough to reduce the TPL image below the
> >>> limit.
> >>>
> >>> Signed-off-by: Simon Glass <sjg@chromium.org>
> >>> ---
> >>>
> >>>  lib/Kconfig  | 9 +++++++++
> >>>  lib/string.c | 6 ++++--
> >>>  2 files changed, 13 insertions(+), 2 deletions(-)
> >>>
> >>> diff --git a/lib/Kconfig b/lib/Kconfig
> >>> index 65c01573e1..5bf512d8c0 100644
> >>> --- a/lib/Kconfig
> >>> +++ b/lib/Kconfig
> >>> @@ -52,6 +52,15 @@ config LIB_RAND
> >>>  	help
> >>>  	  This library provides pseudo-random number generator functions.
> >>>
> >>> +config FAST_MEMSET
> >>> +	bool "Use an optimised memset()"
> >>> +	default y
> >>> +	help
> >>> +	  The faster memset() is the arch-specific one (if available) enabled
> >>> +	  by CONFIG_USE_ARCH_MEMSET. If that is not enabled, we can still get
> >>> +	  better performance by write a word at a time. Disable this option
> >>> +	  to reduce code size slightly at the cost of some speed.
> >>
> >> The comment sounds slightly confused - it took me a few times of reading
> >> it until I grasped what it was trying to tell me :).
> >>
> >>> +
> >>>  source lib/dhry/Kconfig
> >>>
> >>>  source lib/rsa/Kconfig
> >>> diff --git a/lib/string.c b/lib/string.c
> >>> index 67d5f6a421..159493ed17 100644
> >>> --- a/lib/string.c
> >>> +++ b/lib/string.c
> >>> @@ -437,8 +437,10 @@ char *strswab(const char *s)
> >>>  void * memset(void * s,int c,size_t count)
> >>>  {
> >>>  	unsigned long *sl = (unsigned long *) s;
> >>> -	unsigned long cl = 0;
> >>>  	char *s8;
> >>> +
> >>> +#ifdef CONFIG_FAST_MEMSET
> >>> +	unsigned long cl = 0;
> >>>  	int i;
> >>>
> >>>  	/* do it one word at a time (32 bits or 64 bits) while possible */
> >>> @@ -452,7 +454,7 @@ void * memset(void * s,int c,size_t count)
> >>>  			count -= sizeof(*sl);
> >>>  		}
> >>>  	}
> >>> -	/* fill 8 bits at a time */
> >>> +#endif	/* fill 8 bits at a time */
> >>
> >> So while this is all neat, a few ideas:
> >>
> >> 1) Would having memset in a header improve things even more? After all,
> >> each external function call clobbers registers that you need to
> >> save/restore...
> >
> > I'd guess it really depends on the size constraints. The regular
> > libgeneric memset compiles on my rk3188 tpl to a total of
> > 64bytes on both gcc-4.9 and gcc-6.3 while Simon's fast-memset
> > comes down to 14bytes on my rk3188.
> >
> > On the rk3188 the only memset user is board_init_f, so here memset
> > is called only once without needing to save registers and I'd guess if an
> > implementation really is that size-constrained to worry about 50bytes
> > this one caller will probably always be the only one?
> 
> I'm not sure I follow. If you put it into a header, the compiler has a 
> better chance of evicting untaken code paths and optimize register usage 
> over object linked variants (unless you use GOLD). I was mostly 
> wondering whether that would already give you the savings without 
> introducing a complicated #ifdef that is going to bitrot over time :).

On rk3188-tpl that small non-fast memset gets compiled to (bfd linker):
100809aa <board_init_f_init_reserve>:
100809aa:       b510            push    {r4, lr}
100809ac:       22c0            movs    r2, #192        ; 0xc0
100809ae:       2100            movs    r1, #0
100809b0:       4604            mov     r4, r0
100809b2:       f000 f804       bl      100809be <memset>
100809b6:       34c0            adds    r4, #192        ; 0xc0
100809b8:       f8c9 4090       str.w   r4, [r9, #144]  ; 0x90
100809bc:       bd10            pop     {r4, pc}

100809be <memset>:
100809be:       4402            add     r2, r0
100809c0:       4603            mov     r3, r0
100809c2:       4293            cmp     r3, r2
100809c4:       d100            bne.n   100809c8 <memset+0xa>
100809c6:       4770            bx      lr
100809c8:       f803 1b01       strb.w  r1, [r3], #1
100809cc:       e7f9            b.n     100809c2 <memset+0x4>

not saving any outside registers, as it's used only once at all and what
I was trying to say was that in cases where we worry about having the
tiniest memset possible, I guess that will most likely stay the only call.

But I may have been dug into the rk3188 tpl-specifics to long, to see
other possible cases right now :-) .


> I'm just slightly worried about the massive number of preprocessor 
> excludes that happen in U-Boot in general. It seems like something 
> that's really hard to ever have full testing coverage on.

That's essentially what I was worried about as well, seeing that memset
can be provided by different sources it seems.
There is the libgeneric memset we're having here and also the arch-
specific memset (way faster but also again way bigger) and without using
either, one could also provide some completely separate implementation
at the moment.

So having one version in a header would probably also incur some sort of
ifdef voodoo?


> >> 2) How much would GOLD save you? Have you tried? U-Boot is small enough
> >> of a code base that global optimizations should be able to give
> >> significant size savings.
> >
> > I think the issue that this is trying to solve is to allow more
> > toolchains to be used and thus make rebuilds on changes work on a lot
> > of boards at the same time with random toolchains.
> >
> > gcc-6.3 already produces way smaller results (well within the size
> > constraints the rk3188 has) than for example the gcc-4.9 used by
> > buildman as baseline toolchain.
> 
> Ah, I see. So 4.9 does not have -lto? There's a good chance my gut 
> feeling that GOLD actually saves anything is wrong - I don't know. Has 
> anyone done the numbers? Then we would have something to actually base 
> gut feeling on.

It looks like the u-boot Makefile makes explicitly sure to use GNU ld.
So I didn't try to dig deeper into this :-) .


> Size is always a serious constraint in U-Boot, especially in SPL 
> environments. If we can include one more tool in our portfolio to 
> optimize size across the board, I'm all for it. This patch just feels 
> slightly short-term - but I'm definitely not nack'ing it :).


Heiko

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [U-Boot] [PATCH v2] string: Provide a slimmed-down memset()
  2017-03-26 23:38 ` [U-Boot] [PATCH 2/3] string: Provide a slimmed-down memset() Simon Glass
  2017-03-27  7:14   ` Alexander Graf
  2017-03-27 19:55   ` Heiko Stuebner
@ 2017-03-30 11:14   ` Heiko Stuebner
  2 siblings, 0 replies; 12+ messages in thread
From: Heiko Stuebner @ 2017-03-30 11:14 UTC (permalink / raw)
  To: u-boot

Most of the time the optimised memset() is what we want. For extreme
situations such as TPL it may be too large. For example on the 'rock'
board, using a simple loop saves a useful 48 bytes. With gcc 4.9 and
the rodata bug, this patch is enough to reduce the TPL image below the
limit.

Signed-off-by: Simon Glass <sjg@chromium.org>
Signed-off-by: Heiko Stuebner <heiko@sntech.de>
---
Hi Simon,

a bit bikesheddy, but might it make more sense to structure the
options like below? That way it matches USE_ARCH_MEMSET and might
make the intent visible better, as you get
USE_ARCH_MEMSET=y = biggest but also fastest
(nothing) = default from libgeneric
USE_TINY_MEMSET=y = optimize for size over speed

Also might make reading defconfigs easier as you would have
    CONFIG_USE_TINY_MEMSET=y
instead of
    # CONFIG_FAST_MEMSET is not set
when needing that option.

Anyway, I've tested both variants on a live rk3188-rock now and
everything of course still works, even when build with gcc-4.9, so
both variants also
Tested-by: Heiko Stuebner <heiko@sntech.de>


Heiko


 lib/Kconfig  | 20 ++++++++++++++++++++
 lib/string.c |  5 ++++-
 2 files changed, 24 insertions(+), 1 deletion(-)

diff --git a/lib/Kconfig b/lib/Kconfig
index 65c01573e1..ab42413839 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -52,6 +52,26 @@ config LIB_RAND
 	help
 	  This library provides pseudo-random number generator functions.
 
+config USE_TINY_MEMSET
+	bool "Use a size-optimized memset()"
+	help
+	  This makes memset prefer code size over speed optimizations.
+	  The fastest memset() is the arch-specific one (if available) enabled
+	  by CONFIG_USE_ARCH_MEMSET. If that is not enabled, we can still get
+	  better performance by writing a word at a time at the cost of
+	  slightly bigger memset code, but in some special cases size might
+	  be more important than speed.
+
+config SPL_USE_TINY_MEMSET
+	bool "Use a size-optimized memset()"
+	help
+	  This makes memset prefer code size over speed optimizations.
+	  The fastest memset() is the arch-specific one (if available) enabled
+	  by CONFIG_USE_ARCH_MEMSET. If that is not enabled, we can still get
+	  better performance by writing a word at a time at the cost of
+	  slightly bigger memset code, but in some special cases size might
+	  be more important than speed.
+
 source lib/dhry/Kconfig
 
 source lib/rsa/Kconfig
diff --git a/lib/string.c b/lib/string.c
index 67d5f6a421..edae997fa6 100644
--- a/lib/string.c
+++ b/lib/string.c
@@ -437,8 +437,10 @@ char *strswab(const char *s)
 void * memset(void * s,int c,size_t count)
 {
 	unsigned long *sl = (unsigned long *) s;
-	unsigned long cl = 0;
 	char *s8;
+
+#if !CONFIG_IS_ENABLED(USE_TINY_MEMSET)
+	unsigned long cl = 0;
 	int i;
 
 	/* do it one word at a time (32 bits or 64 bits) while possible */
@@ -452,6 +454,7 @@ void * memset(void * s,int c,size_t count)
 			count -= sizeof(*sl);
 		}
 	}
+#endif
 	/* fill 8 bits at a time */
 	s8 = (char *)sl;
 	while (count--)
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [U-Boot] [PATCH 1/3] Makefile: Correct dependency race condition with TPL
  2017-03-27 19:39   ` Heiko Stuebner
@ 2017-04-02  0:05     ` Simon Glass
  0 siblings, 0 replies; 12+ messages in thread
From: Simon Glass @ 2017-04-02  0:05 UTC (permalink / raw)
  To: u-boot

On 27 March 2017 at 13:39, Heiko Stuebner <heiko@sntech.de> wrote:
> Am Sonntag, 26. März 2017, 17:38:15 CEST schrieb Simon Glass:
>> At present we sometimes see the following build error when building on a
>> machine with multiple cores.
>>
>> +make[2]: *** No rule to make target 'dts/dt.dtb', needed by 'tpl/u-boot-tpl.dtb'.  Stop.
>>
>> Add a dependency to correct this.
>>
>> Signed-off-by: Simon Glass <sjg@chromium.org>
>
> Fixes the build of rk3188-rock board using buildman for me with more than
> one job, which failed very reliable before, so
>
> Tested-by: Heiko Stuebner <heiko@sntech.de>
>
>> ---
>>
>>  Makefile | 3 ++-
>>  1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/Makefile b/Makefile
>> index 1001bc5a7b..5e82b78ade 100644
>> --- a/Makefile
>> +++ b/Makefile
>> @@ -1351,7 +1351,8 @@ spl/u-boot-spl.sfp: spl/u-boot-spl
>>  spl/boot.bin: spl/u-boot-spl
>>       @:
>>
>> -tpl/u-boot-tpl.bin: tools prepare
>> +tpl/u-boot-tpl.bin: tools prepare \
>> +             $(if $(CONFIG_OF_SEPARATE)$(CONFIG_SPL_OF_PLATDATA),dts/dt.dtb)
>>       $(Q)$(MAKE) obj=tpl -f $(srctree)/scripts/Makefile.spl all
>>
>>  TAG_SUBDIRS := $(patsubst %,$(srctree)/%,$(u-boot-dirs) include)
>>
>
>

Applied to u-boot-rockchip.

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2017-04-02  0:05 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-03-26 23:38 [U-Boot] [PATCH 0/3] RFC: Patches to reduce TPL code size Simon Glass
2017-03-26 23:38 ` [U-Boot] [PATCH 1/3] Makefile: Correct dependency race condition with TPL Simon Glass
2017-03-27 19:39   ` Heiko Stuebner
2017-04-02  0:05     ` Simon Glass
2017-03-26 23:38 ` [U-Boot] [PATCH 2/3] string: Provide a slimmed-down memset() Simon Glass
2017-03-27  7:14   ` Alexander Graf
2017-03-27 15:17     ` Heiko Stuebner
2017-03-27 21:16       ` Alexander Graf
2017-03-28 12:34         ` Heiko Stuebner
2017-03-27 19:55   ` Heiko Stuebner
2017-03-30 11:14   ` [U-Boot] [PATCH v2] " Heiko Stuebner
2017-03-26 23:38 ` [U-Boot] [PATCH 3/3] Makefile: Provide an option to select SPL or TPL Simon Glass

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.