All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] Optimise memset on i386
@ 2010-06-23 21:38 Colin Watson
  2010-06-25 18:04 ` Vladimir 'φ-coder/phcoder' Serbinenko
  2010-06-25 18:27 ` Vladimir 'φ-coder/phcoder' Serbinenko
  0 siblings, 2 replies; 10+ messages in thread
From: Colin Watson @ 2010-06-23 21:38 UTC (permalink / raw)
  To: grub-devel

I've been working on a project involving fast boot speeds, which are
timed from post-BIOS (this is an axiom of the project and not something
I get to change).  As such I've been working on optimising GRUB, and
have been digging into it with this general debugging pattern (probably
not optimal, I'm just showing it as an example of my approach in case
anyone else is trying to do similar things):

#include <grub/time.h>
#include <grub/env.h>

void
grub_function (void)
{
  char *value, *valueend;

  value = grub_malloc (16384);
  *value = '\0';
  valueend = value;

  grub_snprintf (valueend, 100, "%lld", grub_get_time_ms ());
  valueend = grub_strchr (valueend, '\0');

  /* do slow stuff */

  grub_snprintf (valueend, 100, "%lld", grub_get_time_ms ());
  valueend = grub_strchr (valueend, '\0');

  /* do more slow stuff */

  grub_snprintf (valueend, 100, "%lld", grub_get_time_ms ());
  valueend = grub_strchr (valueend, '\0');
  
  grub_env_set ("identifying_name", value);
  grub_free (value);
}

This seems lightweight enough that it doesn't interfere much with
timings, and the approach of stuffing things into an environment
variable is useful because you can get millisecond-precision timings
even for video initialisation.

With this approach, one of the most noticeable time sinks is that
setting a graphical video mode (I'm using the VBE backend) takes ages:
1.6 seconds, which is a substantial percentage of this project's total
boot time.  It turns out that most of this is spent initialising
double-buffering: doublebuf_pageflipping_init calls
grub_video_fb_create_render_target_from_pointer twice, and each call
takes a little over 600 milliseconds.  Now,
grub_video_fb_create_render_target_from_pointer is basically just a big
grub_memset to clear framebuffer memory, so this equates to under two
frames per second.  What's going on?

It turns out that write caching is disabled on video memory when GRUB is
running, so we take a cache stall on every single write, and it's
apparently hard to enable caching without implementing MTRRs.  People
who know more about this than I do tell me that this can get
unpleasantly CPU-specific at times, although I still hold out some hope
that it's possible in GRUB.

However, there's a way to substantially speed things up without that.
The naïve implementation of grub_memset writes a byte at a time, and for
that matter on i386 it compiles to a poorly-optimised loop rather than
using REP STOS or similar.  grub_memset is an inner loop practically by
definition, and it's worth optimising.  We can fix both of these
weaknesses by importing the optimised memset from GNU libc: since it
writes four bytes at a time except (sometimes) at the start and end, it
should take about a quarter the number of cache stalls.  And, indeed,
measurement bears this out: instead of taking over 600 milliseconds per
call to grub_video_fb_create_render_target_from_pointer (I think it was
actually 630 or so, though I neglected to write that down), GRUB now
takes about 160 milliseconds per call.  Much better!

The optimised memset is LGPLv2.1 or later, and I've preserved that
notice, but as far as I know this should be fine for use in GRUB; it can
be upgraded to LGPLv3, and that's just GPLv3 with some additional
permissions.  It's already assigned to the FSF due to being in glibc.

2010-06-23  Colin Watson  <cjwatson@ubuntu.com>

        * conf/any-emu.rmk (kernel_img_SOURCES): Add kern/string.c.
        * conf/common.rmk (grub_mkdevicemap_SOURCES): Likewise.
	(grub_probe_SOURCES): Likewise.
	(grub_fstest_SOURCES): Likewise.
	(grub_script_check_SOURCES): Likewise.
	(grub_editenv_SOURCES): Likewise.
	* conf/mips-qemu-mips.rmk (kernel_img_SOURCES): Likewise.
	* conf/mips-yeeloong.rmk (kernel_img_SOURCES): Likewise.
	* conf/powerpc-ieee1275.rmk (kernel_img_SOURCES): Likewise.
	* conf/sparc64-ieee1275.rmk (kernel_img_SOURCES): Likewise.
	(grub_setup_SOURCES): Likewise.
	* conf/tests.rmk (example_unit_test_SOURCES): Likewise.

	* conf/i386-coreboot.rmk (kernel_img_SOURCES): Add
	kern/i386/string.c.
	* conf/i386-ieee1275.rmk (kernel_img_SOURCES): Likewise.
	* conf/i386-multiboot.rmk (kernel_img_SOURCES): Likewise.
	* conf/i386-pc.rmk (kernel_img_SOURCES): Likewise.
	(grub_setup_SOURCES): Likewise.
	* conf/i386-qemu.rmk (kernel_img_SOURCES): Likewise.
	* conf/x86-efi.rmk (kernel_img_SOURCES): Likewise.

	* kern/i386/string.c: New file.
	* kern/misc.c (grub_memset): Move to ...
	* kern/string.c (grub_memset): ... here.
	* kern/misc.c (memset): Move to ...
	* kern/string.c (memset): ... here.

=== modified file 'conf/any-emu.rmk'
--- conf/any-emu.rmk	2010-06-11 20:31:16 +0000
+++ conf/any-emu.rmk	2010-06-23 16:55:35 +0000
@@ -6,7 +6,7 @@ kernel_img_SOURCES = kern/device.c kern/
 	kern/err.c kern/list.c kern/command.c		\
 	kern/corecmd.c kern/file.c kern/fs.c kern/main.c kern/misc.c	\
 	kern/parser.c kern/partition.c kern/term.c			\
-	kern/rescue_reader.c kern/rescue_parser.c			\
+	kern/rescue_reader.c kern/rescue_parser.c kern/string.c		\
 									\
 	kern/emu/main.c kern/emu/mm.c kern/emu/misc.c			\
 	kern/emu/getroot.c kern/emu/time.c kern/emu/hostdisk.c		\

=== modified file 'conf/common.rmk'
--- conf/common.rmk	2010-06-21 15:04:30 +0000
+++ conf/common.rmk	2010-06-23 21:04:23 +0000
@@ -7,7 +7,8 @@ sbin_UTILITIES += grub-mkdevicemap
 grub_mkdevicemap_SOURCES = gnulib/progname.c util/grub-mkdevicemap.c \
 	util/deviceiter.c \
 	util/misc.c kern/emu/misc.c \
-	kern/env.c kern/err.c kern/list.c kern/misc.c kern/emu/mm.c
+	kern/env.c kern/err.c kern/list.c kern/misc.c kern/string.c \
+	kern/emu/mm.c
 
 ifeq ($(target_cpu)-$(platform), sparc64-ieee1275)
 grub_mkdevicemap_SOURCES += util/ieee1275/ofpath.c util/ieee1275/devicemap.c
@@ -27,7 +28,7 @@ util/grub-probe.c_DEPENDENCIES = grub_pr
 grub_probe_SOURCES = gnulib/progname.c util/grub-probe.c	\
 	kern/emu/hostdisk.c util/misc.c kern/emu/misc.c kern/emu/getroot.c kern/emu/mm.c	\
 	kern/device.c kern/disk.c kern/err.c kern/misc.c	\
-	kern/partition.c kern/file.c kern/list.c	\
+	kern/partition.c kern/file.c kern/list.c kern/string.c	\
 	\
 	fs/affs.c fs/cpio.c fs/fat.c fs/ext2.c fs/hfs.c		\
 	fs/hfsplus.c fs/iso9660.c fs/udf.c fs/jfs.c fs/minix.c	\
@@ -49,6 +50,7 @@ util/grub-fstest.c_DEPENDENCIES = grub_f
 grub_fstest_SOURCES = gnulib/progname.c util/grub-fstest.c kern/emu/hostfs.c \
 	util/misc.c kern/emu/misc.c kern/emu/mm.c 	\
 	kern/file.c kern/device.c kern/disk.c kern/err.c kern/misc.c	\
+	kern/string.c							\
 	disk/host.c disk/loopback.c kern/list.c kern/command.c		\
 	lib/arg.c commands/extcmd.c normal/datetime.c normal/misc.c	\
 	lib/hexdump.c lib/crc.c commands/blocklist.c commands/ls.c 	\
@@ -92,7 +94,7 @@ grub_script_check_SOURCES = gnulib/progn
 	util/grub-script-check.c util/misc.c kern/emu/misc.c kern/emu/mm.c \
 	script/main.c script/script.c script/function.c script/lexer.c \
 	kern/err.c kern/list.c \
-	kern/misc.c kern/env.c grub_script.tab.c \
+	kern/misc.c kern/env.c kern/string.c grub_script.tab.c \
 	grub_script.yy.c
 grub_script_check_CFLAGS = $(GNULIB_UTIL_CFLAGS)
 grub_script_check_DEPENDENCIES = grub_script.tab.h
@@ -160,7 +162,7 @@ DISTCLEANFILES += grub_fstest_init.c
 
 # for grub-editenv
 bin_UTILITIES += grub-editenv
-grub_editenv_SOURCES = gnulib/progname.c util/grub-editenv.c lib/envblk.c util/misc.c kern/emu/misc.c kern/emu/mm.c kern/misc.c kern/err.c
+grub_editenv_SOURCES = gnulib/progname.c util/grub-editenv.c lib/envblk.c util/misc.c kern/emu/misc.c kern/emu/mm.c kern/misc.c kern/err.c kern/string.c
 CLEANFILES += grub-editenv
 
 # Needed for genmk.rb to work

=== modified file 'conf/i386-coreboot.rmk'
--- conf/i386-coreboot.rmk	2010-06-11 20:31:16 +0000
+++ conf/i386-coreboot.rmk	2010-06-23 16:58:20 +0000
@@ -16,7 +16,7 @@ kernel_img_SOURCES = kern/i386/coreboot/
 	kern/rescue_parser.c kern/rescue_reader.c \
 	kern/time.c kern/list.c  kern/command.c kern/corecmd.c \
 	kern/$(target_cpu)/dl.c kern/parser.c kern/partition.c \
-	kern/i386/tsc.c kern/i386/pit.c \
+	kern/i386/tsc.c kern/i386/pit.c kern/i386/string.c \
 	kern/generic/rtc_get_time_ms.c \
 	kern/generic/millisleep.c \
 	kern/env.c \

=== modified file 'conf/i386-ieee1275.rmk'
--- conf/i386-ieee1275.rmk	2010-06-11 20:31:16 +0000
+++ conf/i386-ieee1275.rmk	2010-06-23 16:58:48 +0000
@@ -19,6 +19,7 @@ kernel_img_SOURCES = kern/i386/ieee1275/
 	kern/$(target_cpu)/dl.c kern/parser.c kern/partition.c \
 	kern/env.c \
 	kern/time.c kern/list.c  kern/command.c kern/corecmd.c \
+	kern/i386/string.c \
 	kern/generic/millisleep.c \
 	kern/ieee1275/ieee1275.c \
 	term/ieee1275/ofconsole.c \

=== modified file 'conf/i386-multiboot.rmk'
--- conf/i386-multiboot.rmk	2010-05-01 12:06:53 +0000
+++ conf/i386-multiboot.rmk	2010-06-23 16:58:56 +0000
@@ -18,7 +18,7 @@ kernel_img_SOURCES = kern/i386/coreboot/
 	kern/rescue_parser.c kern/rescue_reader.c \
 	kern/time.c kern/list.c kern/handler.c kern/command.c kern/corecmd.c \
 	kern/$(target_cpu)/dl.c kern/parser.c kern/partition.c \
-	kern/i386/tsc.c kern/i386/pit.c \
+	kern/i386/tsc.c kern/i386/pit.c kern/i386/string.c \
 	kern/generic/rtc_get_time_ms.c \
 	kern/generic/millisleep.c \
 	kern/env.c \

=== modified file 'conf/i386-pc.rmk'
--- conf/i386-pc.rmk	2010-06-12 11:17:28 +0000
+++ conf/i386-pc.rmk	2010-06-23 21:04:23 +0000
@@ -46,7 +46,7 @@ kernel_img_SOURCES = kern/i386/pc/startu
 	kern/time.c kern/list.c kern/command.c kern/corecmd.c \
 	kern/$(target_cpu)/dl.c kern/i386/pc/init.c kern/i386/pc/mmap.c \
 	kern/parser.c kern/partition.c \
-	kern/i386/tsc.c kern/i386/pit.c \
+	kern/i386/tsc.c kern/i386/pit.c kern/i386/string.c \
 	kern/generic/rtc_get_time_ms.c \
 	kern/generic/millisleep.c \
 	kern/env.c \
@@ -66,7 +66,7 @@ util/i386/pc/grub-setup.c_DEPENDENCIES =
 grub_setup_SOURCES = gnulib/progname.c util/i386/pc/grub-setup.c	\
 	util/misc.c kern/emu/misc.c kern/emu/getroot.c			\
 	kern/emu/hostdisk.c kern/device.c kern/disk.c kern/err.c	\
-	kern/misc.c kern/partition.c kern/file.c			\
+	kern/misc.c kern/partition.c kern/file.c kern/i386/string.c	\
 	kern/emu/mm.c kern/fs.c kern/env.c kern/list.c fs/fshelp.c	\
 									\
 	fs/affs.c fs/cpio.c fs/ext2.c fs/fat.c fs/hfs.c			\

=== modified file 'conf/i386-qemu.rmk'
--- conf/i386-qemu.rmk	2010-06-11 20:31:16 +0000
+++ conf/i386-qemu.rmk	2010-06-23 16:59:14 +0000
@@ -25,7 +25,7 @@ kernel_img_SOURCES = kern/i386/qemu/star
 	kern/rescue_parser.c kern/rescue_reader.c \
 	kern/time.c kern/list.c  kern/command.c kern/corecmd.c \
 	kern/$(target_cpu)/dl.c kern/parser.c kern/partition.c \
-	kern/i386/tsc.c kern/i386/pit.c \
+	kern/i386/tsc.c kern/i386/pit.c kern/i386/string.c \
 	kern/generic/rtc_get_time_ms.c \
 	kern/generic/millisleep.c \
 	kern/env.c \

=== modified file 'conf/mips-qemu-mips.rmk'
--- conf/mips-qemu-mips.rmk	2010-06-11 20:31:16 +0000
+++ conf/mips-qemu-mips.rmk	2010-06-23 16:59:28 +0000
@@ -11,7 +11,7 @@ kernel_img_SOURCES = kern/$(target_cpu)/
 	kern/$(target_cpu)/$(target_machine)/init.c 		\
 	kern/disk.c kern/dl.c kern/err.c kern/file.c kern/fs.c 		\
 	kern/misc.c kern/mm.c kern/term.c 	\
-	kern/rescue_parser.c kern/rescue_reader.c \
+	kern/rescue_parser.c kern/rescue_reader.c kern/string.c \
 	kern/list.c  kern/command.c kern/corecmd.c	\
 	kern/parser.c kern/partition.c kern/env.c kern/$(target_cpu)/dl.c 	\
 	kern/generic/millisleep.c kern/generic/rtc_get_time_ms.c kern/time.c    \

=== modified file 'conf/mips-yeeloong.rmk'
--- conf/mips-yeeloong.rmk	2010-06-11 20:31:16 +0000
+++ conf/mips-yeeloong.rmk	2010-06-23 16:59:37 +0000
@@ -15,7 +15,7 @@ kernel_img_SOURCES = kern/$(target_cpu)/
 	kern/$(target_cpu)/$(target_machine)/init.c 		\
 	kern/disk.c kern/dl.c kern/err.c kern/file.c kern/fs.c 		\
 	kern/misc.c kern/mm.c kern/term.c 	\
-	kern/rescue_parser.c kern/rescue_reader.c \
+	kern/rescue_parser.c kern/rescue_reader.c kern/string.c \
 	kern/list.c  kern/command.c kern/corecmd.c	\
 	kern/parser.c kern/partition.c kern/env.c kern/$(target_cpu)/dl.c 	\
 	kern/generic/millisleep.c kern/generic/rtc_get_time_ms.c kern/time.c    \

=== modified file 'conf/powerpc-ieee1275.rmk'
--- conf/powerpc-ieee1275.rmk	2010-06-11 20:31:16 +0000
+++ conf/powerpc-ieee1275.rmk	2010-06-23 16:59:58 +0000
@@ -12,7 +12,7 @@ kernel_img_SOURCES = kern/powerpc/ieee12
 	kern/ieee1275/ieee1275.c kern/main.c kern/device.c 		\
 	kern/disk.c kern/dl.c kern/err.c kern/file.c kern/fs.c 		\
 	kern/misc.c kern/mm.c kern/term.c 	\
-	kern/rescue_parser.c kern/rescue_reader.c \
+	kern/rescue_parser.c kern/rescue_reader.c kern/string.c \
 	kern/list.c  kern/command.c kern/corecmd.c	\
 	kern/ieee1275/init.c 						\
 	kern/ieee1275/mmap.c						\

=== modified file 'conf/sparc64-ieee1275.rmk'
--- conf/sparc64-ieee1275.rmk	2010-06-11 20:31:16 +0000
+++ conf/sparc64-ieee1275.rmk	2010-06-23 17:00:18 +0000
@@ -25,7 +25,7 @@ kernel_img_SOURCES = kern/sparc64/ieee12
 	kern/ieee1275/ieee1275.c kern/main.c kern/device.c		\
 	kern/disk.c kern/dl.c kern/err.c kern/file.c kern/fs.c		\
 	kern/misc.c kern/mm.c kern/term.c			\
-	kern/rescue_parser.c kern/rescue_reader.c \
+	kern/rescue_parser.c kern/rescue_reader.c kern/string.c \
 	kern/list.c  kern/command.c kern/corecmd.c	\
 	kern/sparc64/ieee1275/ieee1275.c 				\
 	kern/sparc64/ieee1275/init.c 					\
@@ -48,7 +48,7 @@ util/sparc64/ieee1275/grub-setup.c_DEPEN
 grub_setup_SOURCES = util/sparc64/ieee1275/grub-setup.c			\
 	util/ieee1275/ofpath.c util/misc.c kern/emu/hostdisk.c		\
 	kern/emu/misc.c kern/emu/getroot.c kern/emu/mm.c kern/device.c	\
-	kern/disk.c kern/err.c kern/misc.c 				\
+	kern/disk.c kern/err.c kern/misc.c kern/string.c 		\
 	kern/partition.c kern/file.c kern/fs.c kern/env.c kern/list.c	\
 	fs/fshelp.c							\
 									\

=== modified file 'conf/tests.rmk'
--- conf/tests.rmk	2010-04-30 08:20:41 +0000
+++ conf/tests.rmk	2010-06-23 17:00:25 +0000
@@ -21,7 +21,7 @@ functional_test_mod_LDFLAGS = $(COMMON_L
 
 # Rules for unit tests
 check_UTILITIES += example_unit_test
-example_unit_test_SOURCES = tests/example_unit_test.c kern/list.c kern/misc.c tests/lib/test.c tests/lib/unit_test.c
+example_unit_test_SOURCES = tests/example_unit_test.c kern/list.c kern/misc.c kern/string.c tests/lib/test.c tests/lib/unit_test.c
 example_unit_test_CFLAGS  = -Wno-format
 
 # Rules for functional tests

=== modified file 'conf/x86-efi.rmk'
--- conf/x86-efi.rmk	2010-06-12 11:17:28 +0000
+++ conf/x86-efi.rmk	2010-06-23 21:04:23 +0000
@@ -26,7 +26,7 @@ kernel_img_SOURCES = kern/$(target_cpu)/
 	kern/env.c symlist.c kern/efi/efi.c kern/efi/init.c kern/efi/mm.c \
 	term/efi/console.c disk/efi/efidisk.c \
 	kern/time.c kern/list.c  kern/command.c kern/corecmd.c \
-	kern/i386/tsc.c kern/i386/pit.c \
+	kern/i386/tsc.c kern/i386/pit.c kern/i386/string.c \
 	kern/generic/rtc_get_time_ms.c \
 	kern/generic/millisleep.c
 ifeq ($(target_cpu),x86_64)

=== added file 'kern/i386/string.c'
--- kern/i386/string.c	1970-01-01 00:00:00 +0000
+++ kern/i386/string.c	2010-06-23 17:03:10 +0000
@@ -0,0 +1,96 @@
+/*
+ *  GRUB  --  GRand Unified Bootloader
+ *  Copyright (C) 2002,2003,2004,2005,2006,2007,2008,2009,2010  Free Software Foundation, Inc.
+ *
+ *  This optimised memset implementation originates in GNU libc.  Its
+ *  copyright and licensing information follow.
+ *
+ *  Set a block of memory to some byte value.
+ *  For Intel 80x86, x>=3.
+ *  Copyright (C) 1991,1992,1993,1997,1998,2003, 2005 Free Software Foundation, Inc.
+ *  This file is part of the GNU C Library.
+ *  Contributed by Torbjorn Granlund (tege@sics.se).
+ *
+ *  The GNU C Library is free software; you can redistribute it and/or
+ *  modify it under the terms of the GNU Lesser General Public
+ *  License as published by the Free Software Foundation; either
+ *  version 2.1 of the License, or (at your option) any later version.
+ *
+ *  The GNU C Library is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ *  Lesser General Public License for more details.
+ *
+ *  You should have received a copy of the GNU Lesser General Public
+ *  License along with the GNU C Library; if not, write to the Free
+ *  Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA
+ *  02111-1307 USA.
+ */
+
+#include <grub/misc.h>
+
+#define OPSIZ (sizeof (unsigned long int))
+
+void *
+grub_memset (void *dstpp, int c, grub_size_t len)
+{
+  int d0;
+  unsigned long int dstp = (unsigned long int) dstpp;
+
+  /* This explicit register allocation
+     improves code very much indeed.  */
+  register unsigned long int x asm("ax");
+
+  x = (unsigned char) c;
+
+  /* Clear the direction flag, so filling will move forward.  */
+  asm volatile("cld");
+
+  /* This threshold value is optimal.  */
+  if (len >= 12)
+    {
+      /* Fill X with four copies of the char we want to fill with.  */
+      x |= (x << 8);
+      x |= (x << 16);
+
+      /* Adjust LEN for the bytes handled in the first loop.  */
+      len -= (-dstp) % OPSIZ;
+
+      /* There are at least some bytes to set.
+         No need to test for LEN == 0 in this alignment loop.  */
+
+      /* Fill bytes until DSTP is aligned on a longword boundary.  */
+      asm volatile("rep\n"
+                   "stosb" /* %0, %2, %3 */ :
+                   "=D" (dstp), "=c" (d0) :
+                   "0" (dstp), "1" ((-dstp) % OPSIZ), "a" (x) :
+                   "memory");
+
+      /* Fill longwords.  */
+      asm volatile("rep\n"
+                   "stosl" /* %0, %2, %3 */ :
+                   "=D" (dstp), "=c" (d0) :
+                   "0" (dstp), "1" (len / OPSIZ), "a" (x) :
+                   "memory");
+      len %= OPSIZ;
+    }
+
+  /* Write the last few bytes.  */
+  asm volatile("rep\n"
+               "stosb" /* %0, %2, %3 */ :
+               "=D" (dstp), "=c" (d0) :
+               "0" (dstp), "1" (len), "a" (x) :
+               "memory");
+
+  return dstpp;
+}
+
+#ifndef APPLE_CC
+void *memset (void *s, int c, grub_size_t n)
+  __attribute__ ((alias ("grub_memset")));
+#else
+void *memset (void *s, int c, grub_size_t n)
+{
+  return grub_memset (s, c, n);
+}
+#endif

=== modified file 'kern/misc.c'
--- kern/misc.c	2010-05-28 13:48:45 +0000
+++ kern/misc.c	2010-06-23 17:01:49 +0000
@@ -506,26 +506,6 @@ grub_strndup (const char *s, grub_size_t
   return p;
 }
 
-void *
-grub_memset (void *s, int c, grub_size_t n)
-{
-  unsigned char *p = (unsigned char *) s;
-
-  while (n--)
-    *p++ = (unsigned char) c;
-
-  return s;
-}
-#ifndef APPLE_CC
-void *memset (void *s, int c, grub_size_t n)
-  __attribute__ ((alias ("grub_memset")));
-#else
-void *memset (void *s, int c, grub_size_t n)
-{
-  return grub_memset (s, c, n);
-}
-#endif
-
 grub_size_t
 grub_strlen (const char *s)
 {

=== added file 'kern/string.c'
--- kern/string.c	1970-01-01 00:00:00 +0000
+++ kern/string.c	2010-06-23 17:01:42 +0000
@@ -0,0 +1,42 @@
+/* string.c - definitions of string functions with architecture-specific
+   optimisations */
+/*
+ *  GRUB  --  GRand Unified Bootloader
+ *  Copyright (C) 1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010  Free Software Foundation, Inc.
+ *
+ *  GRUB is free software: you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License as published by
+ *  the Free Software Foundation, either version 3 of the License, or
+ *  (at your option) any later version.
+ *
+ *  GRUB is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with GRUB.  If not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include <grub/misc.h>
+
+void *
+grub_memset (void *s, int c, grub_size_t n)
+{
+  unsigned char *p = (unsigned char *) s;
+
+  while (n--)
+    *p++ = (unsigned char) c;
+
+  return s;
+}
+
+#ifndef APPLE_CC
+void *memset (void *s, int c, grub_size_t n)
+  __attribute__ ((alias ("grub_memset")));
+#else
+void *memset (void *s, int c, grub_size_t n)
+{
+  return grub_memset (s, c, n);
+}
+#endif

Thanks,

-- 
Colin Watson                                       [cjwatson@ubuntu.com]


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] Optimise memset on i386
  2010-06-23 21:38 [PATCH] Optimise memset on i386 Colin Watson
@ 2010-06-25 18:04 ` Vladimir 'φ-coder/phcoder' Serbinenko
  2010-06-25 18:27 ` Vladimir 'φ-coder/phcoder' Serbinenko
  1 sibling, 0 replies; 10+ messages in thread
From: Vladimir 'φ-coder/phcoder' Serbinenko @ 2010-06-25 18:04 UTC (permalink / raw)
  To: grub-devel

[-- Attachment #1: Type: text/plain, Size: 3201 bytes --]

On 06/23/2010 11:38 PM, Colin Watson wrote:
> With this approach, one of the most noticeable time sinks is that
> setting a graphical video mode (I'm using the VBE backend) takes ages:
> 1.6 seconds, which is a substantial percentage of this project's total
> boot time.  It turns out that most of this is spent initialising
> double-buffering: doublebuf_pageflipping_init calls
> grub_video_fb_create_render_target_from_pointer twice, and each call
> takes a little over 600 milliseconds.  Now,
> grub_video_fb_create_render_target_from_pointer is basically just a big
> grub_memset to clear framebuffer memory, so this equates to under two
> frames per second.  What's going on?
>
> It turns out that write caching is disabled on video memory when GRUB is
> running, so we take a cache stall on every single write, and it's
> apparently hard to enable caching without implementing MTRRs.  People
> who know more about this than I do tell me that this can get
> unpleasantly CPU-specific at times, although I still hold out some hope
> that it's possible in GRUB.
>
>   
On non-device memory GRUB should take advantage of cache. On MIPS
enabling/disabling cache is done by using a different address. So we
have all infrastructure necessary for differentiating
cacheable/non-cacheable is present. Enabling cache on video memory is
however more of a trouble. One of the reasons is that cache nmishandling
produces difficult bugs.
> However, there's a way to substantially speed things up without that.
> The naïve implementation of grub_memset writes a byte at a time, and for
> that matter on i386 it compiles to a poorly-optimised loop rather than
> using REP STOS or similar.  grub_memset is an inner loop practically by
> definition, and it's worth optimising.  We can fix both of these
> weaknesses by importing the optimised memset from GNU libc: since it
> writes four bytes at a time except (sometimes) at the start and end, it
> should take about a quarter the number of cache stalls.  And, indeed,
> measurement bears this out: instead of taking over 600 milliseconds per
> call to grub_video_fb_create_render_target_from_pointer (I think it was
> actually 630 or so, though I neglected to write that down), GRUB now
> takes about 160 milliseconds per call.  Much better!
>
> The optimised memset is LGPLv2.1 or later, and I've preserved that
> notice, but as far as I know this should be fine for use in GRUB; it can
> be upgraded to LGPLv3, and that's just GPLv3 with some additional
> permissions.  It's already assigned to the FSF due to being in glibc.
>
>   
It's ok to use this code but be sure to mention its origin. It's also ok
to keep its license unless big divergeance is to be expected.

Did you test it on x86_64?
> +void *
> +grub_memset (void *s, int c, grub_size_t n)
> +{
> +  unsigned char *p = (unsigned char *) s;
> +
> +  while (n--)
> +    *p++ = (unsigned char) c;
> +
> +  return s;
> +}
>   
This can be optimised the same way as i386 part, just replace stos with
a loop over iterator with a pointer aligned on its size.
> Thanks,
>
>   


-- 
Regards
Vladimir 'φ-coder/phcoder' Serbinenko



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 294 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] Optimise memset on i386
  2010-06-23 21:38 [PATCH] Optimise memset on i386 Colin Watson
  2010-06-25 18:04 ` Vladimir 'φ-coder/phcoder' Serbinenko
@ 2010-06-25 18:27 ` Vladimir 'φ-coder/phcoder' Serbinenko
  2010-07-23 11:14   ` Colin Watson
  1 sibling, 1 reply; 10+ messages in thread
From: Vladimir 'φ-coder/phcoder' Serbinenko @ 2010-06-25 18:27 UTC (permalink / raw)
  To: grub-devel


[-- Attachment #1.1: Type: text/plain, Size: 793 bytes --]

>
> +void *
> +grub_memset (void *s, int c, grub_size_t n)
> +{
> +  unsigned char *p = (unsigned char *) s;
> +
> +  while (n--)
> +    *p++ = (unsigned char) c;
> +
> +  return s;
> +}
>   
Attached is a possible generic implementation. Not even compile-tested.
Could you test it and compare disasm of this version on i386 and glibc
i386 version. Perhaps they are equivalent. If they are for maintainance
reasons it's better to always use generic one.
> +
> +#ifndef APPLE_CC
> +void *memset (void *s, int c, grub_size_t n)
> +  __attribute__ ((alias ("grub_memset")));
> +#else
> +void *memset (void *s, int c, grub_size_t n)
> +{
> +  return grub_memset (s, c, n);
> +}
> +#endif
>
> Thanks,
>
>   


-- 
Regards
Vladimir 'φ-coder/phcoder' Serbinenko


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1.2: memset.c --]
[-- Type: text/x-csrc; name="memset.c", Size: 684 bytes --]


void
grub_memset (void *p, int c, grub_size_t len)
{
  grub_uint8_t pattern8 = c;
  if (len >= 3 * sizeof (unsigned long))
    {
      unsigned long patternl = 0;
      int i;
      for (i = 0; i < sizeof (unsigned long); i++)
	patternl |= ((unsigned long) pattern8) << (8 * i);
      while (len > 0 && (((grub_addr_t) p) & (sizeof (unsigned long) - 1)))
	{
	  *(grub_uint8_t *) p = pattern8;
	  p = (grub_uint8_t *) p + 1;
	}
      while (len >= sizeof (unsigned long))
	{
	  *(unsigned long *) p = patternl;
	  p = (unsigned long *) p + 1;
	}
    }
  while (len > 0)
    {
      *(grub_uint8_t *) p = pattern8;
      p = (grub_uint8_t *) p + 1;
    }
}

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 294 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] Optimise memset on i386
  2010-06-25 18:27 ` Vladimir 'φ-coder/phcoder' Serbinenko
@ 2010-07-23 11:14   ` Colin Watson
  2010-07-23 15:56     ` richardvoigt
  2010-07-28 14:21     ` Vladimir 'φ-coder/phcoder' Serbinenko
  0 siblings, 2 replies; 10+ messages in thread
From: Colin Watson @ 2010-07-23 11:14 UTC (permalink / raw)
  To: The development of GNU GRUB

On Fri, Jun 25, 2010 at 08:27:20PM +0200, Vladimir 'φ-coder/phcoder' Serbinenko wrote:
> > +void *
> > +grub_memset (void *s, int c, grub_size_t n)
> > +{
> > +  unsigned char *p = (unsigned char *) s;
> > +
> > +  while (n--)
> > +    *p++ = (unsigned char) c;
> > +
> > +  return s;
> > +}
> 
> Attached is a possible generic implementation. Not even compile-tested.
> Could you test it and compare disasm of this version on i386 and glibc
> i386 version. Perhaps they are equivalent. If they are for maintainance
> reasons it's better to always use generic one.

Thanks for this, and sorry for my delay.

I haven't compared the disassembly, but once I fixed up your code it
performed within about 10% of the optimised version I posted before, so
about 640ms original, 160ms with i386-specific memset, and 176ms with
your generic memset.  Even though that isn't quite equivalent, I don't
think that 16ms is going to be a major problem, and so I would suggest
going with the generic version.  Please check the following.

2010-07-23  Vladimir Serbinenko  <phcoder@gmail.com>
2010-07-23  Colin Watson  <cjwatson@ubuntu.com>

	* kern/misc.c (grub_memset): Optimise to reduce cache stalls.

=== modified file 'kern/misc.c'
--- kern/misc.c	2010-07-02 17:35:07 +0000
+++ kern/misc.c	2010-07-23 11:05:32 +0000
@@ -518,12 +518,39 @@ grub_strndup (const char *s, grub_size_t
 }
 
 void *
-grub_memset (void *s, int c, grub_size_t n)
+grub_memset (void *s, int c, grub_size_t len)
 {
-  unsigned char *p = (unsigned char *) s;
+  void *p = s;
+  grub_uint8_t pattern8 = c;
 
-  while (n--)
-    *p++ = (unsigned char) c;
+  if (len >= 3 * sizeof (unsigned long))
+    {
+      unsigned long patternl = 0;
+      grub_size_t i;
+
+      for (i = 0; i < sizeof (unsigned long); i++)
+	patternl |= ((unsigned long) pattern8) << (8 * i);
+
+      while (len > 0 && (((grub_addr_t) p) & (sizeof (unsigned long) - 1)))
+	{
+	  *(grub_uint8_t *) p = pattern8;
+	  p = (grub_uint8_t *) p + 1;
+	  len--;
+	}
+      while (len >= sizeof (unsigned long))
+	{
+	  *(unsigned long *) p = patternl;
+	  p = (unsigned long *) p + 1;
+	  len -= sizeof (unsigned long);
+	}
+    }
+
+  while (len > 0)
+    {
+      *(grub_uint8_t *) p = pattern8;
+      p = (grub_uint8_t *) p + 1;
+      len--;
+    }
 
   return s;
 }

-- 
Colin Watson                                       [cjwatson@ubuntu.com]


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] Optimise memset on i386
  2010-07-23 11:14   ` Colin Watson
@ 2010-07-23 15:56     ` richardvoigt
  2010-07-23 17:34       ` Christian Franke
  2010-07-24 22:40       ` Colin Watson
  2010-07-28 14:21     ` Vladimir 'φ-coder/phcoder' Serbinenko
  1 sibling, 2 replies; 10+ messages in thread
From: richardvoigt @ 2010-07-23 15:56 UTC (permalink / raw)
  To: The development of GNU GRUB

[-- Attachment #1: Type: text/plain, Size: 514 bytes --]

[snip]

> +      unsigned long patternl = 0;
> +      grub_size_t i;
> +
> +      for (i = 0; i < sizeof (unsigned long); i++)
> +       patternl |= ((unsigned long) pattern8) << (8 * i);
> +
>

might I suggest:

unsigned long patternl = pattern8;
patternl |= patternl << 8;
patternl |= patternl << 16;
patternl |= patternl << 32;
patternl |= patternl << 64;

O(lg N) instead of O(N), no loop, no branches, and the compiler should be
smart enough to optimize away the last two lines on systems with narrower
long.

[-- Attachment #2: Type: text/html, Size: 853 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] Optimise memset on i386
  2010-07-23 15:56     ` richardvoigt
@ 2010-07-23 17:34       ` Christian Franke
  2010-07-23 19:48         ` richardvoigt
  2010-07-24 22:40       ` Colin Watson
  1 sibling, 1 reply; 10+ messages in thread
From: Christian Franke @ 2010-07-23 17:34 UTC (permalink / raw)
  To: The development of GNU GRUB

richardvoigt wrote:
>
> might I suggest:
>
> unsigned long patternl = pattern8;
> patternl |= patternl << 8;
> patternl |= patternl << 16;
> patternl |= patternl << 32;
> patternl |= patternl << 64;
>
> O(lg N) instead of O(N), no loop, no branches, and the compiler should 
> be smart enough to optimize away the last two lines on systems with 
> narrower long.

The latter is unfortunately not the case. At least gcc 4.5.0 prints a 
warning but still produces code.

$ cat <<EOF >f.c
unsigned long f(unsigned long x)
{
   x |= x << 32;
   x |= x << 64;
   return x;
}

$ gcc -O3 -S f.c
x.c: In function ‘f’:
x.c:3: warning: left shift count >= width of type
x.c:4: warning: left shift count >= width of type

$ cat f.s
...
         pushl   %ebp
         movl    $32, %ecx
         movl    %esp, %ebp
         movl    8(%ebp), %eax
         popl    %ebp
         movl    %eax, %edx
         sall    %cl, %edx
         movl    $64, %ecx
         orl     %eax, %edx
         movl    %edx, %eax
         sall    %cl, %eax
         orl     %edx, %eax
         ret


-- 
Regards,
Christian Franke



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] Optimise memset on i386
  2010-07-23 17:34       ` Christian Franke
@ 2010-07-23 19:48         ` richardvoigt
  2010-07-23 20:09           ` richardvoigt
  0 siblings, 1 reply; 10+ messages in thread
From: richardvoigt @ 2010-07-23 19:48 UTC (permalink / raw)
  To: The development of GNU GRUB

[-- Attachment #1: Type: text/plain, Size: 961 bytes --]

On Fri, Jul 23, 2010 at 12:34 PM, Christian Franke <
Christian.Franke@t-online.de> wrote:

> richardvoigt wrote:
>
>>
>> might I suggest:
>>
>> unsigned long patternl = pattern8;
>> patternl |= patternl << 8;
>> patternl |= patternl << 16;
>> patternl |= patternl << 32;
>> patternl |= patternl << 64;
>>
>> O(lg N) instead of O(N), no loop, no branches, and the compiler should be
>> smart enough to optimize away the last two lines on systems with narrower
>> long.
>>
>
> The latter is unfortunately not the case. At least gcc 4.5.0 prints a
> warning but still produces code.
>
>
Ok then, a compile-time conditional should fix that.

unsigned long patternl = pattern8;
if (8 < sizeof patternl) patternl |= patternl << 8;
if (16 < sizeof patternl) patternl |= patternl << 16;
if (32 < sizeof patternl) patternl |= patternl << 32;
if (64 < sizeof patternl) patternl |= patternl << 64;

I'm pretty confident in gcc's ability to optimize this version.  I hope.

[-- Attachment #2: Type: text/html, Size: 1518 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] Optimise memset on i386
  2010-07-23 19:48         ` richardvoigt
@ 2010-07-23 20:09           ` richardvoigt
  0 siblings, 0 replies; 10+ messages in thread
From: richardvoigt @ 2010-07-23 20:09 UTC (permalink / raw)
  To: The development of GNU GRUB

[-- Attachment #1: Type: text/plain, Size: 1138 bytes --]

On Fri, Jul 23, 2010 at 2:48 PM, richardvoigt@gmail.com <
richardvoigt@gmail.com> wrote:

>
>
> On Fri, Jul 23, 2010 at 12:34 PM, Christian Franke <
> Christian.Franke@t-online.de> wrote:
>
>> richardvoigt wrote:
>>
>>>
>>> might I suggest:
>>>
>>> unsigned long patternl = pattern8;
>>> patternl |= patternl << 8;
>>> patternl |= patternl << 16;
>>> patternl |= patternl << 32;
>>> patternl |= patternl << 64;
>>>
>>> O(lg N) instead of O(N), no loop, no branches, and the compiler should be
>>> smart enough to optimize away the last two lines on systems with narrower
>>> long.
>>>
>>
>> The latter is unfortunately not the case. At least gcc 4.5.0 prints a
>> warning but still produces code.
>>
>>
> Ok then, a compile-time conditional should fix that.
>
>

Thanks to Seth for pointing out my obvious bug.

unsigned long patternl = pattern8;
if (1 < sizeof patternl) patternl |= patternl << 8;
if (2 < sizeof patternl) patternl |= patternl << 16;
if (4 < sizeof patternl) patternl |= patternl << 32;
if (8 < sizeof patternl) patternl |= patternl << 64;



> I'm pretty confident in gcc's ability to optimize this version.  I hope.
>

[-- Attachment #2: Type: text/html, Size: 2173 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] Optimise memset on i386
  2010-07-23 15:56     ` richardvoigt
  2010-07-23 17:34       ` Christian Franke
@ 2010-07-24 22:40       ` Colin Watson
  1 sibling, 0 replies; 10+ messages in thread
From: Colin Watson @ 2010-07-24 22:40 UTC (permalink / raw)
  To: The development of GNU GRUB

On Fri, Jul 23, 2010 at 10:56:24AM -0500, richardvoigt@gmail.com wrote:
> [snip]
> 
> > +      unsigned long patternl = 0;
> > +      grub_size_t i;
> > +
> > +      for (i = 0; i < sizeof (unsigned long); i++)
> > +       patternl |= ((unsigned long) pattern8) << (8 * i);
> > +
> 
> might I suggest:
> 
> unsigned long patternl = pattern8;
> patternl |= patternl << 8;
> patternl |= patternl << 16;
> patternl |= patternl << 32;
> patternl |= patternl << 64;
> 
> O(lg N) instead of O(N), no loop, no branches, and the compiler should be
> smart enough to optimize away the last two lines on systems with narrower
> long.

I no longer have the system on which I benchmarked this.  However, since
N is always either 4 or 8 on current targets, this can only amount to
micro-optimisation which I don't think can possibly matter much; we're
talking a handful of cycles at most.  Do we really need to spend time
bikeshedding this?  The important thing is taking only a cache stall per
long rather than a cache stall per byte; anything else is likely to be
noise.

-- 
Colin Watson                                       [cjwatson@ubuntu.com]


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] Optimise memset on i386
  2010-07-23 11:14   ` Colin Watson
  2010-07-23 15:56     ` richardvoigt
@ 2010-07-28 14:21     ` Vladimir 'φ-coder/phcoder' Serbinenko
  1 sibling, 0 replies; 10+ messages in thread
From: Vladimir 'φ-coder/phcoder' Serbinenko @ 2010-07-28 14:21 UTC (permalink / raw)
  To: grub-devel

[-- Attachment #1: Type: text/plain, Size: 2655 bytes --]

On 07/23/2010 02:14 PM, Colin Watson wrote:
> On Fri, Jun 25, 2010 at 08:27:20PM +0200, Vladimir 'φ-coder/phcoder' Serbinenko wrote:
>   
>>> +void *
>>> +grub_memset (void *s, int c, grub_size_t n)
>>> +{
>>> +  unsigned char *p = (unsigned char *) s;
>>> +
>>> +  while (n--)
>>> +    *p++ = (unsigned char) c;
>>> +
>>> +  return s;
>>> +}
>>>       
>> Attached is a possible generic implementation. Not even compile-tested.
>> Could you test it and compare disasm of this version on i386 and glibc
>> i386 version. Perhaps they are equivalent. If they are for maintainance
>> reasons it's better to always use generic one.
>>     
> Thanks for this, and sorry for my delay.
>
> I haven't compared the disassembly, but once I fixed up your code it
> performed within about 10% of the optimised version I posted before, so
> about 640ms original, 160ms with i386-specific memset, and 176ms with
> your generic memset.  Even though that isn't quite equivalent, I don't
> think that 16ms is going to be a major problem, and so I would suggest
> going with the generic version.  Please check the following.
>
> 2010-07-23  Vladimir Serbinenko  <phcoder@gmail.com>
> 2010-07-23  Colin Watson  <cjwatson@ubuntu.com>
>
> 	* kern/misc.c (grub_memset): Optimise to reduce cache stalls.
>
>   
Go ahead.
> === modified file 'kern/misc.c'
> --- kern/misc.c	2010-07-02 17:35:07 +0000
> +++ kern/misc.c	2010-07-23 11:05:32 +0000
> @@ -518,12 +518,39 @@ grub_strndup (const char *s, grub_size_t
>  }
>  
>  void *
> -grub_memset (void *s, int c, grub_size_t n)
> +grub_memset (void *s, int c, grub_size_t len)
>  {
> -  unsigned char *p = (unsigned char *) s;
> +  void *p = s;
> +  grub_uint8_t pattern8 = c;
>  
> -  while (n--)
> -    *p++ = (unsigned char) c;
> +  if (len >= 3 * sizeof (unsigned long))
> +    {
> +      unsigned long patternl = 0;
> +      grub_size_t i;
> +
> +      for (i = 0; i < sizeof (unsigned long); i++)
> +	patternl |= ((unsigned long) pattern8) << (8 * i);
> +
> +      while (len > 0 && (((grub_addr_t) p) & (sizeof (unsigned long) - 1)))
> +	{
> +	  *(grub_uint8_t *) p = pattern8;
> +	  p = (grub_uint8_t *) p + 1;
> +	  len--;
> +	}
> +      while (len >= sizeof (unsigned long))
> +	{
> +	  *(unsigned long *) p = patternl;
> +	  p = (unsigned long *) p + 1;
> +	  len -= sizeof (unsigned long);
> +	}
> +    }
> +
> +  while (len > 0)
> +    {
> +      *(grub_uint8_t *) p = pattern8;
> +      p = (grub_uint8_t *) p + 1;
> +      len--;
> +    }
>  
>    return s;
>  }
>
>   


-- 
Regards
Vladimir 'φ-coder/phcoder' Serbinenko



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 294 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2010-07-28 14:22 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-06-23 21:38 [PATCH] Optimise memset on i386 Colin Watson
2010-06-25 18:04 ` Vladimir 'φ-coder/phcoder' Serbinenko
2010-06-25 18:27 ` Vladimir 'φ-coder/phcoder' Serbinenko
2010-07-23 11:14   ` Colin Watson
2010-07-23 15:56     ` richardvoigt
2010-07-23 17:34       ` Christian Franke
2010-07-23 19:48         ` richardvoigt
2010-07-23 20:09           ` richardvoigt
2010-07-24 22:40       ` Colin Watson
2010-07-28 14:21     ` Vladimir 'φ-coder/phcoder' Serbinenko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.