All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Christian König" <christian.koenig@amd.com>
To: Lucas De Marchi <lucas.demarchi@intel.com>,
	intel-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org
Cc: Daniel Vetter <daniel.vetter@intel.com>, tzimmermann@suse.de
Subject: Re: [Intel-gfx] [CI 1/2] iosys-map: Add per-word read
Date: Mon, 4 Jul 2022 08:30:08 +0200	[thread overview]
Message-ID: <032a834c-3dbd-e997-9e4c-b3baccec43cb@amd.com> (raw)
In-Reply-To: <20220627224751.3627465-1-lucas.demarchi@intel.com>

Am 28.06.22 um 00:47 schrieb Lucas De Marchi:
> Instead of always falling back to memcpy_fromio() for any size, prefer
> using read{b,w,l}(). When reading struct members it's common to read
> individual integer variables individually. Going through memcpy_fromio()
> for each of them poses a high penalty.
>
> Employ a similar trick as __seqprop() by using _Generic() to generate
> only the specific call based on a type-compatible variable.
>
> For a pariticular i915 workload producing GPU context switches,
> __get_engine_usage_record() is particularly hot since the engine usage
> is read from device local memory with dgfx, possibly multiple times
> since it's racy. Test execution time for this test shows a ~12.5%
> improvement with DG2:
>
> Before:
> 	nrepeats = 1000; min = 7.63243e+06; max = 1.01817e+07;
> 	median = 9.52548e+06; var = 526149;
> After:
> 	nrepeats = 1000; min = 7.03402e+06; max = 8.8832e+06;
> 	median = 8.33955e+06; var = 333113;
>
> Other things attempted that didn't prove very useful:
> 1) Change the _Generic() on x86 to just dereference the memory address
> 2) Change __get_engine_usage_record() to do just 1 read per loop,
>     comparing with the previous value read
> 3) Change __get_engine_usage_record() to access the fields directly as it
>     was before the conversion to iosys-map
>
> (3) did gave a small improvement (~3%), but doesn't seem to scale well
> to other similar cases in the driver.
>
> Additional test by Chris Wilson using gem_create from igt with some
> changes to track object creation time. This happens to accidentally
> stress this code path:
>
> 	Pre iosys_map conversion of engine busyness:
> 	lmem0: Creating    262144 4KiB objects took 59274.2ms
>
> 	Unpatched:
> 	lmem0: Creating    262144 4KiB objects took 108830.2ms
>
> 	With readl (this patch):
> 	lmem0: Creating    262144 4KiB objects took 61348.6ms
>
> 	s/readl/READ_ONCE/
> 	lmem0: Creating    262144 4KiB objects took 61333.2ms
>
> So we do take a little bit more time than before the conversion, but
> that is due to other factors: bringing the READ_ONCE back would be as
> good as just doing this conversion.
>
> v2:
> - Remove default from _Generic() - callers wanting to read more
>    than u64 should use iosys_map_memcpy_from()
> - Add READ_ONCE() cases dereferencing the pointer when using system
>    memory
>
> Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
> Reviewed-by: Christian König <christian.koenig@amd.com> # v1
> Reviewed-by: Thomas Zimmermann <tzimmermann@suse.de>

Feel free to update my rb to v2 as well.

Apart from that do you have commit rights to drm-misc-next? If not 
should we push this?

Thanks,
Christian.

> ---
>   include/linux/iosys-map.h | 45 +++++++++++++++++++++++++++++++--------
>   1 file changed, 36 insertions(+), 9 deletions(-)
>
> diff --git a/include/linux/iosys-map.h b/include/linux/iosys-map.h
> index 4b8406ee8bc4..ec81ed995c59 100644
> --- a/include/linux/iosys-map.h
> +++ b/include/linux/iosys-map.h
> @@ -6,6 +6,7 @@
>   #ifndef __IOSYS_MAP_H__
>   #define __IOSYS_MAP_H__
>   
> +#include <linux/compiler_types.h>
>   #include <linux/io.h>
>   #include <linux/string.h>
>   
> @@ -333,6 +334,26 @@ static inline void iosys_map_memset(struct iosys_map *dst, size_t offset,
>   		memset(dst->vaddr + offset, value, len);
>   }
>   
> +#ifdef CONFIG_64BIT
> +#define __iosys_map_rd_io_u64_case(val_, vaddr_iomem_)				\
> +	u64: val_ = readq(vaddr_iomem_)
> +#else
> +#define __iosys_map_rd_io_u64_case(val_, vaddr_iomem_)				\
> +	u64: memcpy_fromio(&(val_), vaddr_iomem_, sizeof(u64))
> +#endif
> +
> +#define __iosys_map_rd_io(val__, vaddr_iomem__, type__) _Generic(val__,		\
> +	u8: val__ = readb(vaddr_iomem__),					\
> +	u16: val__ = readw(vaddr_iomem__),					\
> +	u32: val__ = readl(vaddr_iomem__),					\
> +	__iosys_map_rd_io_u64_case(val__, vaddr_iomem__))
> +
> +#define __iosys_map_rd_sys(val__, vaddr__, type__) ({				\
> +	compiletime_assert(sizeof(type__) <= sizeof(u64),			\
> +			   "Unsupported access size for __iosys_map_rd_sys()");	\
> +	val__ = READ_ONCE(*((type__ *)vaddr__));				\
> +})
> +
>   /**
>    * iosys_map_rd - Read a C-type value from the iosys_map
>    *
> @@ -340,16 +361,21 @@ static inline void iosys_map_memset(struct iosys_map *dst, size_t offset,
>    * @offset__:	The offset from which to read
>    * @type__:	Type of the value being read
>    *
> - * Read a C type value from iosys_map, handling possible un-aligned accesses to
> - * the mapping.
> + * Read a C type value (u8, u16, u32 and u64) from iosys_map. For other types or
> + * if pointer may be unaligned (and problematic for the architecture supported),
> + * use iosys_map_memcpy_from().
>    *
>    * Returns:
>    * The value read from the mapping.
>    */
> -#define iosys_map_rd(map__, offset__, type__) ({			\
> -	type__ val;							\
> -	iosys_map_memcpy_from(&val, map__, offset__, sizeof(val));	\
> -	val;								\
> +#define iosys_map_rd(map__, offset__, type__) ({				\
> +	type__ val;								\
> +	if ((map__)->is_iomem) {						\
> +		__iosys_map_rd_io(val, (map__)->vaddr_iomem + (offset__), type__);\
> +	} else {								\
> +		__iosys_map_rd_sys(val, (map__)->vaddr + (offset__), type__);	\
> +	}									\
> +	val;									\
>   })
>   
>   /**
> @@ -379,9 +405,10 @@ static inline void iosys_map_memset(struct iosys_map *dst, size_t offset,
>    *
>    * Read a value from iosys_map considering its layout is described by a C struct
>    * starting at @struct_offset__. The field offset and size is calculated and its
> - * value read handling possible un-aligned memory accesses. For example: suppose
> - * there is a @struct foo defined as below and the value ``foo.field2.inner2``
> - * needs to be read from the iosys_map:
> + * value read. If the field access would incur in un-aligned access, then either
> + * iosys_map_memcpy_from() needs to be used or the architecture must support it.
> + * For example: suppose there is a @struct foo defined as below and the value
> + * ``foo.field2.inner2`` needs to be read from the iosys_map:
>    *
>    * .. code-block:: c
>    *


WARNING: multiple messages have this Message-ID (diff)
From: "Christian König" <christian.koenig@amd.com>
To: Lucas De Marchi <lucas.demarchi@intel.com>,
	intel-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org
Cc: Daniel Vetter <daniel.vetter@intel.com>, tzimmermann@suse.de
Subject: Re: [CI 1/2] iosys-map: Add per-word read
Date: Mon, 4 Jul 2022 08:30:08 +0200	[thread overview]
Message-ID: <032a834c-3dbd-e997-9e4c-b3baccec43cb@amd.com> (raw)
In-Reply-To: <20220627224751.3627465-1-lucas.demarchi@intel.com>

Am 28.06.22 um 00:47 schrieb Lucas De Marchi:
> Instead of always falling back to memcpy_fromio() for any size, prefer
> using read{b,w,l}(). When reading struct members it's common to read
> individual integer variables individually. Going through memcpy_fromio()
> for each of them poses a high penalty.
>
> Employ a similar trick as __seqprop() by using _Generic() to generate
> only the specific call based on a type-compatible variable.
>
> For a pariticular i915 workload producing GPU context switches,
> __get_engine_usage_record() is particularly hot since the engine usage
> is read from device local memory with dgfx, possibly multiple times
> since it's racy. Test execution time for this test shows a ~12.5%
> improvement with DG2:
>
> Before:
> 	nrepeats = 1000; min = 7.63243e+06; max = 1.01817e+07;
> 	median = 9.52548e+06; var = 526149;
> After:
> 	nrepeats = 1000; min = 7.03402e+06; max = 8.8832e+06;
> 	median = 8.33955e+06; var = 333113;
>
> Other things attempted that didn't prove very useful:
> 1) Change the _Generic() on x86 to just dereference the memory address
> 2) Change __get_engine_usage_record() to do just 1 read per loop,
>     comparing with the previous value read
> 3) Change __get_engine_usage_record() to access the fields directly as it
>     was before the conversion to iosys-map
>
> (3) did gave a small improvement (~3%), but doesn't seem to scale well
> to other similar cases in the driver.
>
> Additional test by Chris Wilson using gem_create from igt with some
> changes to track object creation time. This happens to accidentally
> stress this code path:
>
> 	Pre iosys_map conversion of engine busyness:
> 	lmem0: Creating    262144 4KiB objects took 59274.2ms
>
> 	Unpatched:
> 	lmem0: Creating    262144 4KiB objects took 108830.2ms
>
> 	With readl (this patch):
> 	lmem0: Creating    262144 4KiB objects took 61348.6ms
>
> 	s/readl/READ_ONCE/
> 	lmem0: Creating    262144 4KiB objects took 61333.2ms
>
> So we do take a little bit more time than before the conversion, but
> that is due to other factors: bringing the READ_ONCE back would be as
> good as just doing this conversion.
>
> v2:
> - Remove default from _Generic() - callers wanting to read more
>    than u64 should use iosys_map_memcpy_from()
> - Add READ_ONCE() cases dereferencing the pointer when using system
>    memory
>
> Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
> Reviewed-by: Christian König <christian.koenig@amd.com> # v1
> Reviewed-by: Thomas Zimmermann <tzimmermann@suse.de>

Feel free to update my rb to v2 as well.

Apart from that do you have commit rights to drm-misc-next? If not 
should we push this?

Thanks,
Christian.

> ---
>   include/linux/iosys-map.h | 45 +++++++++++++++++++++++++++++++--------
>   1 file changed, 36 insertions(+), 9 deletions(-)
>
> diff --git a/include/linux/iosys-map.h b/include/linux/iosys-map.h
> index 4b8406ee8bc4..ec81ed995c59 100644
> --- a/include/linux/iosys-map.h
> +++ b/include/linux/iosys-map.h
> @@ -6,6 +6,7 @@
>   #ifndef __IOSYS_MAP_H__
>   #define __IOSYS_MAP_H__
>   
> +#include <linux/compiler_types.h>
>   #include <linux/io.h>
>   #include <linux/string.h>
>   
> @@ -333,6 +334,26 @@ static inline void iosys_map_memset(struct iosys_map *dst, size_t offset,
>   		memset(dst->vaddr + offset, value, len);
>   }
>   
> +#ifdef CONFIG_64BIT
> +#define __iosys_map_rd_io_u64_case(val_, vaddr_iomem_)				\
> +	u64: val_ = readq(vaddr_iomem_)
> +#else
> +#define __iosys_map_rd_io_u64_case(val_, vaddr_iomem_)				\
> +	u64: memcpy_fromio(&(val_), vaddr_iomem_, sizeof(u64))
> +#endif
> +
> +#define __iosys_map_rd_io(val__, vaddr_iomem__, type__) _Generic(val__,		\
> +	u8: val__ = readb(vaddr_iomem__),					\
> +	u16: val__ = readw(vaddr_iomem__),					\
> +	u32: val__ = readl(vaddr_iomem__),					\
> +	__iosys_map_rd_io_u64_case(val__, vaddr_iomem__))
> +
> +#define __iosys_map_rd_sys(val__, vaddr__, type__) ({				\
> +	compiletime_assert(sizeof(type__) <= sizeof(u64),			\
> +			   "Unsupported access size for __iosys_map_rd_sys()");	\
> +	val__ = READ_ONCE(*((type__ *)vaddr__));				\
> +})
> +
>   /**
>    * iosys_map_rd - Read a C-type value from the iosys_map
>    *
> @@ -340,16 +361,21 @@ static inline void iosys_map_memset(struct iosys_map *dst, size_t offset,
>    * @offset__:	The offset from which to read
>    * @type__:	Type of the value being read
>    *
> - * Read a C type value from iosys_map, handling possible un-aligned accesses to
> - * the mapping.
> + * Read a C type value (u8, u16, u32 and u64) from iosys_map. For other types or
> + * if pointer may be unaligned (and problematic for the architecture supported),
> + * use iosys_map_memcpy_from().
>    *
>    * Returns:
>    * The value read from the mapping.
>    */
> -#define iosys_map_rd(map__, offset__, type__) ({			\
> -	type__ val;							\
> -	iosys_map_memcpy_from(&val, map__, offset__, sizeof(val));	\
> -	val;								\
> +#define iosys_map_rd(map__, offset__, type__) ({				\
> +	type__ val;								\
> +	if ((map__)->is_iomem) {						\
> +		__iosys_map_rd_io(val, (map__)->vaddr_iomem + (offset__), type__);\
> +	} else {								\
> +		__iosys_map_rd_sys(val, (map__)->vaddr + (offset__), type__);	\
> +	}									\
> +	val;									\
>   })
>   
>   /**
> @@ -379,9 +405,10 @@ static inline void iosys_map_memset(struct iosys_map *dst, size_t offset,
>    *
>    * Read a value from iosys_map considering its layout is described by a C struct
>    * starting at @struct_offset__. The field offset and size is calculated and its
> - * value read handling possible un-aligned memory accesses. For example: suppose
> - * there is a @struct foo defined as below and the value ``foo.field2.inner2``
> - * needs to be read from the iosys_map:
> + * value read. If the field access would incur in un-aligned access, then either
> + * iosys_map_memcpy_from() needs to be used or the architecture must support it.
> + * For example: suppose there is a @struct foo defined as below and the value
> + * ``foo.field2.inner2`` needs to be read from the iosys_map:
>    *
>    * .. code-block:: c
>    *


  parent reply	other threads:[~2022-07-04 16:17 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-06-27 22:47 [CI 1/2] iosys-map: Add per-word read Lucas De Marchi
2022-06-27 22:47 ` [Intel-gfx] " Lucas De Marchi
2022-06-27 22:47 ` [CI 2/2] iosys-map: Add per-word write Lucas De Marchi
2022-06-27 22:47   ` [Intel-gfx] " Lucas De Marchi
2022-06-28  4:05 ` [Intel-gfx] ✗ Fi.CI.CHECKPATCH: warning for series starting with [CI,1/2] iosys-map: Add per-word read Patchwork
2022-06-28  4:05 ` [Intel-gfx] ✗ Fi.CI.SPARSE: " Patchwork
2022-06-28  4:28 ` [Intel-gfx] ✗ Fi.CI.BAT: failure " Patchwork
2022-06-28  5:52 ` [Intel-gfx] ✗ Fi.CI.CHECKPATCH: warning for series starting with [CI,1/2] iosys-map: Add per-word read (rev2) Patchwork
2022-06-28  5:52 ` [Intel-gfx] ✗ Fi.CI.SPARSE: " Patchwork
2022-06-28  6:14 ` [Intel-gfx] ✗ Fi.CI.BAT: failure " Patchwork
2022-07-04  6:30 ` Christian König [this message]
2022-07-04  6:30   ` [CI 1/2] iosys-map: Add per-word read Christian König
2022-07-04 17:14   ` Lucas De Marchi
2022-07-04 17:14     ` [Intel-gfx] " Lucas De Marchi
2022-06-28 19:10 Lucas De Marchi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=032a834c-3dbd-e997-9e4c-b3baccec43cb@amd.com \
    --to=christian.koenig@amd.com \
    --cc=daniel.vetter@intel.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=intel-gfx@lists.freedesktop.org \
    --cc=lucas.demarchi@intel.com \
    --cc=tzimmermann@suse.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.