All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/7] assorted v5.x related fixups
@ 2021-03-27 10:19 Philippe Gerum
  2021-03-27 10:19 ` [PATCH v2 1/7] cobalt/memory: fix __vmalloc() calls Philippe Gerum
                   ` (7 more replies)
  0 siblings, 8 replies; 27+ messages in thread
From: Philippe Gerum @ 2021-03-27 10:19 UTC (permalink / raw)
  To: xenomai

From: Philippe Gerum <rpm@xenomai.org>

More wrappers and other changes required to build for the kernel v5.x
series.

Philippe Gerum (7):
  cobalt/memory: fix __vmalloc() calls
  cobalt/debug: switch to mmap_lock interface
  cobalt/kernel: convert to proc_ops
  cobalt/debug: prefer dump_stack() to show_stack()
  drivers/net: wrap csum_partial_copy_nocheck()
  drivers/net: icmp: remove variable-length array
  drivers/net: cfg: fix config file load up

 kernel/cobalt/debug.c                         |  6 +-
 kernel/cobalt/heap.c                          |  3 +-
 .../include/asm-generic/xenomai/wrappers.h    | 59 +++++++++++++++
 kernel/cobalt/posix/memory.c                  |  7 +-
 kernel/cobalt/vfile.c                         | 26 +++----
 kernel/drivers/analogy/device.c               | 12 +--
 kernel/drivers/can/mscan/rtcan_mscan_proc.c   | 12 +--
 kernel/drivers/can/rtcan_module.c             | 66 ++++++++---------
 .../drivers/can/sja1000/rtcan_sja1000_proc.c  | 12 +--
 kernel/drivers/net/stack/ipv4/icmp.c          | 34 ++++++---
 kernel/drivers/net/stack/rtcfg/rtcfg_ioctl.c  | 74 +++++++++----------
 kernel/drivers/net/stack/rtskb.c              |  3 +-
 12 files changed, 189 insertions(+), 125 deletions(-)

-- 
2.29.2



^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v2 1/7] cobalt/memory: fix __vmalloc() calls
  2021-03-27 10:19 [PATCH v2 0/7] assorted v5.x related fixups Philippe Gerum
@ 2021-03-27 10:19 ` Philippe Gerum
  2021-03-27 10:19 ` [PATCH v2 2/7] cobalt/debug: switch to mmap_lock interface Philippe Gerum
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 27+ messages in thread
From: Philippe Gerum @ 2021-03-27 10:19 UTC (permalink / raw)
  To: xenomai

From: Philippe Gerum <rpm@xenomai.org>

Since kernel v5.8, __vmalloc() does not take protection bits as
PROT_KERNEL is now wired in. Therefore we cannot disable the cache for
the UMM segment via the allocation call directly anymore.

This said, we don't support any CPU architecture exhibiting cache
aliasing braindamage anymore either (was armv4/v5), so let's convert
to the new __vmalloc() call format without bothering for cache
settings.

Signed-off-by: Philippe Gerum <rpm@xenomai.org>
---
 kernel/cobalt/heap.c                                 | 3 ++-
 kernel/cobalt/include/asm-generic/xenomai/wrappers.h | 6 ++++++
 kernel/cobalt/posix/memory.c                         | 7 ++++---
 3 files changed, 12 insertions(+), 4 deletions(-)

diff --git a/kernel/cobalt/heap.c b/kernel/cobalt/heap.c
index 9e184d4c3..d3f43729e 100644
--- a/kernel/cobalt/heap.c
+++ b/kernel/cobalt/heap.c
@@ -28,6 +28,7 @@
 #include <cobalt/kernel/heap.h>
 #include <cobalt/kernel/vfile.h>
 #include <cobalt/kernel/ancillaries.h>
+#include <asm/xenomai/wrappers.h>
 
 /**
  * @ingroup cobalt_core
@@ -849,7 +850,7 @@ void *xnheap_vmalloc(size_t size)
 	 * software on a 32bit system had to be wrong in the first
 	 * place anyway.
 	 */
-	return __vmalloc(size, GFP_KERNEL, PAGE_KERNEL);
+	return vmalloc_kernel(size, 0);
 }
 EXPORT_SYMBOL_GPL(xnheap_vmalloc);
 
diff --git a/kernel/cobalt/include/asm-generic/xenomai/wrappers.h b/kernel/cobalt/include/asm-generic/xenomai/wrappers.h
index 9e8733b47..ac4e95aa0 100644
--- a/kernel/cobalt/include/asm-generic/xenomai/wrappers.h
+++ b/kernel/cobalt/include/asm-generic/xenomai/wrappers.h
@@ -176,4 +176,10 @@ devm_hwmon_device_register_with_groups(struct device *dev, const char *name,
 #define old_timeval32     compat_timeval
 #endif
 
+#if LINUX_VERSION_CODE < KERNEL_VERSION(5,8,0)
+#define vmalloc_kernel(__size, __flags)	__vmalloc(__size, GFP_KERNEL|__flags, PAGE_KERNEL)
+#else
+#define vmalloc_kernel(__size, __flags)	__vmalloc(__size, GFP_KERNEL|__flags)
+#endif
+
 #endif /* _COBALT_ASM_GENERIC_WRAPPERS_H */
diff --git a/kernel/cobalt/posix/memory.c b/kernel/cobalt/posix/memory.c
index 1d888a219..fc88e264c 100644
--- a/kernel/cobalt/posix/memory.c
+++ b/kernel/cobalt/posix/memory.c
@@ -320,10 +320,11 @@ int cobalt_umm_init(struct cobalt_umm *umm, u32 size,
 
 	secondary_mode_only();
 
+	/* We don't support CPUs with VIVT caches and the like. */
+	BUG_ON(xnarch_cache_aliasing());
+
 	size = PAGE_ALIGN(size);
-	basemem = __vmalloc(size, GFP_KERNEL|__GFP_ZERO,
-			    xnarch_cache_aliasing() ?
-			    pgprot_noncached(PAGE_KERNEL) : PAGE_KERNEL);
+	basemem = vmalloc_kernel(size, __GFP_ZERO);
 	if (basemem == NULL)
 		return -ENOMEM;
 
-- 
2.29.2



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v2 2/7] cobalt/debug: switch to mmap_lock interface
  2021-03-27 10:19 [PATCH v2 0/7] assorted v5.x related fixups Philippe Gerum
  2021-03-27 10:19 ` [PATCH v2 1/7] cobalt/memory: fix __vmalloc() calls Philippe Gerum
@ 2021-03-27 10:19 ` Philippe Gerum
  2021-03-27 10:19 ` [PATCH v2 3/7] cobalt/kernel: convert to proc_ops Philippe Gerum
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 27+ messages in thread
From: Philippe Gerum @ 2021-03-27 10:19 UTC (permalink / raw)
  To: xenomai

From: Philippe Gerum <rpm@xenomai.org>

Signed-off-by: Philippe Gerum <rpm@xenomai.org>
---
 kernel/cobalt/debug.c                                | 4 ++--
 kernel/cobalt/include/asm-generic/xenomai/wrappers.h | 7 +++++++
 2 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/kernel/cobalt/debug.c b/kernel/cobalt/debug.c
index 1e9edda99..f97144aeb 100644
--- a/kernel/cobalt/debug.c
+++ b/kernel/cobalt/debug.c
@@ -239,7 +239,7 @@ void xndebug_trace_relax(int nr, unsigned long *backtrace,
 
 	memset(&spot, 0, sizeof(spot));
 	mm = get_task_mm(current);
-	down_read(&mm->mmap_sem);
+	mmap_read_lock(mm);
 
 	for (n = 0, depth = 0; n < nr; n++) {
 		pc = backtrace[n];
@@ -278,7 +278,7 @@ void xndebug_trace_relax(int nr, unsigned long *backtrace,
 		depth++;
 	}
 
-	up_read(&mm->mmap_sem);
+	mmap_read_unlock(mm);
 	mmput(mm);
 	free_page((unsigned long)tmp);
 
diff --git a/kernel/cobalt/include/asm-generic/xenomai/wrappers.h b/kernel/cobalt/include/asm-generic/xenomai/wrappers.h
index ac4e95aa0..cd22a8db5 100644
--- a/kernel/cobalt/include/asm-generic/xenomai/wrappers.h
+++ b/kernel/cobalt/include/asm-generic/xenomai/wrappers.h
@@ -176,6 +176,13 @@ devm_hwmon_device_register_with_groups(struct device *dev, const char *name,
 #define old_timeval32     compat_timeval
 #endif
 
+#if LINUX_VERSION_CODE < KERNEL_VERSION(5,8,0)
+#define mmap_read_lock(__mm)	down_read(&mm->mmap_sem)
+#define mmap_read_unlock(__mm)	up_read(&mm->mmap_sem)
+#define mmap_write_lock(__mm)	down_write(&mm->mmap_sem)
+#define mmap_write_unlock(__mm)	up_write(&mm->mmap_sem)
+#endif
+
 #if LINUX_VERSION_CODE < KERNEL_VERSION(5,8,0)
 #define vmalloc_kernel(__size, __flags)	__vmalloc(__size, GFP_KERNEL|__flags, PAGE_KERNEL)
 #else
-- 
2.29.2



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v2 3/7] cobalt/kernel: convert to proc_ops
  2021-03-27 10:19 [PATCH v2 0/7] assorted v5.x related fixups Philippe Gerum
  2021-03-27 10:19 ` [PATCH v2 1/7] cobalt/memory: fix __vmalloc() calls Philippe Gerum
  2021-03-27 10:19 ` [PATCH v2 2/7] cobalt/debug: switch to mmap_lock interface Philippe Gerum
@ 2021-03-27 10:19 ` Philippe Gerum
  2021-04-07  9:52   ` Jan Kiszka
  2021-03-27 10:19 ` [PATCH v2 4/7] cobalt/debug: prefer dump_stack() to show_stack() Philippe Gerum
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 27+ messages in thread
From: Philippe Gerum @ 2021-03-27 10:19 UTC (permalink / raw)
  To: xenomai

From: Philippe Gerum <rpm@xenomai.org>

Signed-off-by: Philippe Gerum <rpm@xenomai.org>
---
 .../include/asm-generic/xenomai/wrappers.h    | 20 ++++++
 kernel/cobalt/vfile.c                         | 26 ++++----
 kernel/drivers/analogy/device.c               | 12 ++--
 kernel/drivers/can/mscan/rtcan_mscan_proc.c   | 12 ++--
 kernel/drivers/can/rtcan_module.c             | 66 +++++++++----------
 .../drivers/can/sja1000/rtcan_sja1000_proc.c  | 12 ++--
 6 files changed, 80 insertions(+), 68 deletions(-)

diff --git a/kernel/cobalt/include/asm-generic/xenomai/wrappers.h b/kernel/cobalt/include/asm-generic/xenomai/wrappers.h
index cd22a8db5..cc0cb0896 100644
--- a/kernel/cobalt/include/asm-generic/xenomai/wrappers.h
+++ b/kernel/cobalt/include/asm-generic/xenomai/wrappers.h
@@ -170,6 +170,26 @@ devm_hwmon_device_register_with_groups(struct device *dev, const char *name,
 #define __kernel_old_timeval	timeval
 #endif
 
+#if LINUX_VERSION_CODE < KERNEL_VERSION(5,6,0)
+#define DEFINE_PROC_OPS(__name, __open, __release, __read, __write) \
+	struct file_operations __name = {			    \
+		.open = (__open),				    \
+		.release = (__release),				    \
+		.read = (__read),				    \
+		.write = (__write),				    \
+		.llseek = seq_lseek,				    \
+}
+#else
+#define DEFINE_PROC_OPS(__name, __open, __release, __read, __write)	\
+	struct proc_ops __name = {					\
+		.proc_open = (__open),					\
+		.proc_release = (__release),				\
+		.proc_read = (__read),					\
+		.proc_write = (__write),				\
+		.proc_lseek = seq_lseek,				\
+}
+#endif
+
 #if LINUX_VERSION_CODE < KERNEL_VERSION(5,8,0)
 #define old_timespec32    compat_timespec
 #define old_itimerspec32  compat_itimerspec
diff --git a/kernel/cobalt/vfile.c b/kernel/cobalt/vfile.c
index f65d46ddf..e9e10ce8d 100644
--- a/kernel/cobalt/vfile.c
+++ b/kernel/cobalt/vfile.c
@@ -340,13 +340,12 @@ ssize_t vfile_snapshot_write(struct file *file, const char __user *buf,
 	return ret;
 }
 
-static struct file_operations vfile_snapshot_fops = {
-	.open = vfile_snapshot_open,
-	.read = seq_read,
-	.write = vfile_snapshot_write,
-	.llseek = seq_lseek,
-	.release = vfile_snapshot_release,
-};
+static const DEFINE_PROC_OPS(vfile_snapshot_fops,
+		       vfile_snapshot_open,
+		       vfile_snapshot_release,
+		       seq_read,
+		       vfile_snapshot_write
+);
 
 /**
  * @fn int xnvfile_init_snapshot(const char *name, struct xnvfile_snapshot *vfile, struct xnvfile_directory *parent)
@@ -592,13 +591,12 @@ ssize_t vfile_regular_write(struct file *file, const char __user *buf,
 	return ret;
 }
 
-static struct file_operations vfile_regular_fops = {
-	.open = vfile_regular_open,
-	.read = seq_read,
-	.write = vfile_regular_write,
-	.llseek = seq_lseek,
-	.release = vfile_regular_release,
-};
+static const DEFINE_PROC_OPS(vfile_regular_fops,
+		       vfile_regular_open,
+		       vfile_regular_release,
+		       seq_read,
+		       vfile_regular_write
+);
 
 /**
  * @fn int xnvfile_init_regular(const char *name, struct xnvfile_regular *vfile, struct xnvfile_directory *parent)
diff --git a/kernel/drivers/analogy/device.c b/kernel/drivers/analogy/device.c
index 160dcf547..6ed588708 100644
--- a/kernel/drivers/analogy/device.c
+++ b/kernel/drivers/analogy/device.c
@@ -95,12 +95,12 @@ static int a4l_proc_transfer_open(struct inode *inode, struct file *file)
 	return single_open(file, a4l_rdproc_transfer, PDE_DATA(inode));
 }
 
-static const struct file_operations a4l_proc_transfer_ops = {
-	.open		= a4l_proc_transfer_open,
-	.read		= seq_read,
-	.llseek		= seq_lseek,
-	.release	= single_release,
-};
+static const DEFINE_PROC_OPS(a4l_proc_transfer_ops,
+			     a4l_proc_transfer_open,
+			     single_release,
+			     seq_read,
+			     NULL
+);
 
 int a4l_proc_attach(struct a4l_device_context * cxt)
 {
diff --git a/kernel/drivers/can/mscan/rtcan_mscan_proc.c b/kernel/drivers/can/mscan/rtcan_mscan_proc.c
index 6b54ad4c7..732a02765 100644
--- a/kernel/drivers/can/mscan/rtcan_mscan_proc.c
+++ b/kernel/drivers/can/mscan/rtcan_mscan_proc.c
@@ -114,12 +114,12 @@ static int rtcan_mscan_proc_regs_open(struct inode *inode, struct file *file)
 	return single_open(file, rtcan_mscan_proc_regs, PDE_DATA(inode));
 }
 
-static const struct file_operations rtcan_mscan_proc_regs_ops = {
-	.open		= rtcan_mscan_proc_regs_open,
-	.read		= seq_read,
-	.llseek		= seq_lseek,
-	.release	= single_release,
-};
+static const DEFINE_PROC_OPS(rtcan_mscan_proc_regs_ops,
+			rtcan_mscan_proc_regs_open,
+			single_elease,
+			seq_read,
+			NULL
+);
 
 int rtcan_mscan_create_proc(struct rtcan_device* dev)
 {
diff --git a/kernel/drivers/can/rtcan_module.c b/kernel/drivers/can/rtcan_module.c
index fbc5c35e4..7f3d4c395 100644
--- a/kernel/drivers/can/rtcan_module.c
+++ b/kernel/drivers/can/rtcan_module.c
@@ -157,12 +157,12 @@ static int rtcan_proc_devices_open(struct inode *inode, struct file *file)
 	return single_open(file, rtcan_read_proc_devices, NULL);
 }
 
-static const struct file_operations rtcan_proc_devices_ops = {
-	.open		= rtcan_proc_devices_open,
-	.read		= seq_read,
-	.llseek		= seq_lseek,
-	.release	= single_release,
-};
+static const DEFINE_PROC_OPS(rtcan_proc_devices_ops,
+			rtcan_proc_devices_open,
+			single_release,
+			seq_read,
+			NULL
+);
 
 static int rtcan_read_proc_sockets(struct seq_file *p, void *data)
 {
@@ -220,13 +220,12 @@ static int rtcan_proc_sockets_open(struct inode *inode, struct file *file)
 	return single_open(file, rtcan_read_proc_sockets, NULL);
 }
 
-static const struct file_operations rtcan_proc_sockets_ops = {
-	.open		= rtcan_proc_sockets_open,
-	.read		= seq_read,
-	.llseek		= seq_lseek,
-	.release	= single_release,
-};
-
+static const DEFINE_PROC_OPS(rtcan_proc_sockets_ops,
+			rtcan_proc_sockets_open,
+			single_release,
+			seq_read,
+			NULL
+);
 
 static int rtcan_read_proc_info(struct seq_file *p, void *data)
 {
@@ -271,14 +270,12 @@ static int rtcan_proc_info_open(struct inode *inode, struct file *file)
 	return single_open(file, rtcan_read_proc_info, PDE_DATA(inode));
 }
 
-static const struct file_operations rtcan_proc_info_ops = {
-	.open		= rtcan_proc_info_open,
-	.read		= seq_read,
-	.llseek		= seq_lseek,
-	.release	= single_release,
-};
-
-
+static const DEFINE_PROC_OPS(rtcan_proc_info_ops,
+			rtcan_proc_info_open,
+			single_release,
+			seq_read,
+			NULL
+);
 
 static int rtcan_read_proc_filter(struct seq_file *p, void *data)
 {
@@ -319,14 +316,12 @@ static int rtcan_proc_filter_open(struct inode *inode, struct file *file)
 	return single_open(file, rtcan_read_proc_filter, PDE_DATA(inode));
 }
 
-static const struct file_operations rtcan_proc_filter_ops = {
-	.open		= rtcan_proc_filter_open,
-	.read		= seq_read,
-	.llseek		= seq_lseek,
-	.release	= single_release,
-};
-
-
+static const DEFINE_PROC_OPS(rtcan_proc_filter_ops,
+			rtcan_proc_filter_open,
+			single_release,
+			seq_read,
+			NULL
+);
 
 static int rtcan_read_proc_version(struct seq_file *p, void *data)
 {
@@ -341,13 +336,12 @@ static int rtcan_proc_version_open(struct inode *inode, struct file *file)
 	return single_open(file, rtcan_read_proc_version, NULL);
 }
 
-static const struct file_operations rtcan_proc_version_ops = {
-	.open		= rtcan_proc_version_open,
-	.read		= seq_read,
-	.llseek		= seq_lseek,
-	.release	= single_release,
-};
-
+static const DEFINE_PROC_OPS(rtcan_proc_version_ops,
+			rtcan_proc_version_open,
+			single_release,
+			seq_read,
+			NULL
+);
 
 void rtcan_dev_remove_proc(struct rtcan_device* dev)
 {
diff --git a/kernel/drivers/can/sja1000/rtcan_sja1000_proc.c b/kernel/drivers/can/sja1000/rtcan_sja1000_proc.c
index b4af8ab2e..0fdee8c37 100644
--- a/kernel/drivers/can/sja1000/rtcan_sja1000_proc.c
+++ b/kernel/drivers/can/sja1000/rtcan_sja1000_proc.c
@@ -50,12 +50,12 @@ static int rtcan_sja_proc_regs_open(struct inode *inode, struct file *file)
 	return single_open(file, rtcan_sja_proc_regs, PDE_DATA(inode));
 }
 
-static const struct file_operations rtcan_sja_proc_regs_ops = {
-	.open		= rtcan_sja_proc_regs_open,
-	.read		= seq_read,
-	.llseek		= seq_lseek,
-	.release	= single_release,
-};
+static const DEFINE_PROC_OPS(rtcan_sja_proc_regs_ops,
+			rtcan_sja_proc_regs_open,
+			single_release,
+			seq_read,
+			NULL
+);
 
 int rtcan_sja_create_proc(struct rtcan_device* dev)
 {
-- 
2.29.2



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v2 4/7] cobalt/debug: prefer dump_stack() to show_stack()
  2021-03-27 10:19 [PATCH v2 0/7] assorted v5.x related fixups Philippe Gerum
                   ` (2 preceding siblings ...)
  2021-03-27 10:19 ` [PATCH v2 3/7] cobalt/kernel: convert to proc_ops Philippe Gerum
@ 2021-03-27 10:19 ` Philippe Gerum
  2021-03-27 10:19 ` [PATCH v2 5/7] drivers/net: wrap csum_partial_copy_nocheck() Philippe Gerum
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 27+ messages in thread
From: Philippe Gerum @ 2021-03-27 10:19 UTC (permalink / raw)
  To: xenomai

From: Philippe Gerum <rpm@xenomai.org>

dump_stack() also prints out the system information, which is always
good to have.

Signed-off-by: Philippe Gerum <rpm@xenomai.org>
---
 kernel/cobalt/debug.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/cobalt/debug.c b/kernel/cobalt/debug.c
index f97144aeb..a6e2cc42d 100644
--- a/kernel/cobalt/debug.c
+++ b/kernel/cobalt/debug.c
@@ -592,7 +592,7 @@ int xnlock_dbg_release(struct xnlock *lock,
 				"          last owner = %s:%u (%s(), CPU #%d)\n",
 		       lock, cpu, lock->file, lock->line, lock->function,
 		       lock->cpu);
-		show_stack(NULL,NULL);
+		dump_stack();
 		return 1;
 	}
 
-- 
2.29.2



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v2 5/7] drivers/net: wrap csum_partial_copy_nocheck()
  2021-03-27 10:19 [PATCH v2 0/7] assorted v5.x related fixups Philippe Gerum
                   ` (3 preceding siblings ...)
  2021-03-27 10:19 ` [PATCH v2 4/7] cobalt/debug: prefer dump_stack() to show_stack() Philippe Gerum
@ 2021-03-27 10:19 ` Philippe Gerum
  2021-04-07 10:06   ` Jan Kiszka
  2021-03-27 10:19 ` [PATCH v2 6/7] drivers/net: icmp: remove variable-length array Philippe Gerum
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 27+ messages in thread
From: Philippe Gerum @ 2021-03-27 10:19 UTC (permalink / raw)
  To: xenomai

From: Philippe Gerum <rpm@xenomai.org>

Since v5.9-rc1, csum_partial_copy_nocheck() forces a zero seed as its
last argument to csum_partial(). According to #cc44c17baf7f3, passing
a non-zero value would not even yield the proper result on some
architectures.

Nevertheless, the current ICMP code does expect a non-zero csum seed
to be used in the next computation, so let's wrap net_csum_copy() to
csum_partial_copy_nocheck() for pre-5.9 kernels, and open code it for
later kernels so that we still feed csum_partial() with the user-given
csum. We still expect the x86, ARM and arm64 implementations of
csum_partial() to do the right thing.

Signed-off-by: Philippe Gerum <rpm@xenomai.org>
---
 .../include/asm-generic/xenomai/wrappers.h     | 11 +++++++++++
 kernel/drivers/net/stack/ipv4/icmp.c           | 18 +++++++++---------
 kernel/drivers/net/stack/rtskb.c               |  3 +--
 3 files changed, 21 insertions(+), 11 deletions(-)

diff --git a/kernel/cobalt/include/asm-generic/xenomai/wrappers.h b/kernel/cobalt/include/asm-generic/xenomai/wrappers.h
index cc0cb0896..cfd28fc47 100644
--- a/kernel/cobalt/include/asm-generic/xenomai/wrappers.h
+++ b/kernel/cobalt/include/asm-generic/xenomai/wrappers.h
@@ -209,4 +209,15 @@ devm_hwmon_device_register_with_groups(struct device *dev, const char *name,
 #define vmalloc_kernel(__size, __flags)	__vmalloc(__size, GFP_KERNEL|__flags)
 #endif
 
+#if LINUX_VERSION_CODE < KERNEL_VERSION(5,9,0)
+#define net_csum_copy(__src, __dst, __len, __csum)	\
+	csum_partial_copy_nocheck(__src, __dst, __len, __csum)
+#else
+#define net_csum_copy(__src, __dst, __len, __csum)			\
+	({								\
+		memcpy(__dst, __src, __len);				\
+		csum_partial(__dst, __len, (__force __wsum)__csum);	\
+	})
+#endif
+
 #endif /* _COBALT_ASM_GENERIC_WRAPPERS_H */
diff --git a/kernel/drivers/net/stack/ipv4/icmp.c b/kernel/drivers/net/stack/ipv4/icmp.c
index b723040aa..0671cbc93 100644
--- a/kernel/drivers/net/stack/ipv4/icmp.c
+++ b/kernel/drivers/net/stack/ipv4/icmp.c
@@ -142,9 +142,9 @@ static int rt_icmp_glue_reply_bits(const void *p, unsigned char *to,
 	if (offset != 0)
 		return -EMSGSIZE;
 
-	csum = csum_partial_copy_nocheck((void *)&icmp_param->head, to,
-					 icmp_param->head_len,
-					 icmp_param->csum);
+	csum = net_csum_copy((void *)&icmp_param->head, to,
+			     icmp_param->head_len,
+			     icmp_param->csum);
 
 	csum = rtskb_copy_and_csum_bits(icmp_param->data.skb,
 					icmp_param->offset,
@@ -259,13 +259,13 @@ static int rt_icmp_glue_request_bits(const void *p, unsigned char *to,
 			    __FUNCTION__);
 		return -1;);
 
-	csum = csum_partial_copy_nocheck((void *)&icmp_param->head, to,
-					 icmp_param->head_len,
-					 icmp_param->csum);
+	csum = net_csum_copy((void *)&icmp_param->head, to,
+			     icmp_param->head_len,
+			     icmp_param->csum);
 
-	csum = csum_partial_copy_nocheck(icmp_param->data.buf,
-					 to + icmp_param->head_len,
-					 fraglen - icmp_param->head_len, csum);
+	csum = net_csum_copy(icmp_param->data.buf,
+			     to + icmp_param->head_len,
+			     fraglen - icmp_param->head_len, csum);
 
 	icmph = (struct icmphdr *)to;
 
diff --git a/kernel/drivers/net/stack/rtskb.c b/kernel/drivers/net/stack/rtskb.c
index 84d021d24..0d0c66e58 100644
--- a/kernel/drivers/net/stack/rtskb.c
+++ b/kernel/drivers/net/stack/rtskb.c
@@ -69,8 +69,7 @@ unsigned int rtskb_copy_and_csum_bits(const struct rtskb *skb, int offset,
 	if ((copy = skb->len - offset) > 0) {
 		if (copy > len)
 			copy = len;
-		csum = csum_partial_copy_nocheck(skb->data + offset, to, copy,
-						 csum);
+		csum = net_csum_copy(skb->data + offset, to, copy, csum);
 		if ((len -= copy) == 0)
 			return csum;
 		offset += copy;
-- 
2.29.2



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v2 6/7] drivers/net: icmp: remove variable-length array
  2021-03-27 10:19 [PATCH v2 0/7] assorted v5.x related fixups Philippe Gerum
                   ` (4 preceding siblings ...)
  2021-03-27 10:19 ` [PATCH v2 5/7] drivers/net: wrap csum_partial_copy_nocheck() Philippe Gerum
@ 2021-03-27 10:19 ` Philippe Gerum
  2021-04-07 10:24   ` Jan Kiszka
  2021-03-27 10:19 ` [PATCH v2 7/7] drivers/net: cfg: fix config file load up Philippe Gerum
  2021-04-07 10:39 ` [PATCH v2 0/7] assorted v5.x related fixups Jan Kiszka
  7 siblings, 1 reply; 27+ messages in thread
From: Philippe Gerum @ 2021-03-27 10:19 UTC (permalink / raw)
  To: xenomai

From: Philippe Gerum <rpm@xenomai.org>

Signed-off-by: Philippe Gerum <rpm@xenomai.org>
---
 kernel/drivers/net/stack/ipv4/icmp.c | 16 ++++++++++++++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/kernel/drivers/net/stack/ipv4/icmp.c b/kernel/drivers/net/stack/ipv4/icmp.c
index 0671cbc93..c3563595e 100644
--- a/kernel/drivers/net/stack/ipv4/icmp.c
+++ b/kernel/drivers/net/stack/ipv4/icmp.c
@@ -313,8 +313,17 @@ static int rt_icmp_send_request(u32 daddr, struct icmp_bxm *icmp_param)
 int rt_icmp_send_echo(u32 daddr, u16 id, u16 sequence, size_t msg_size)
 {
 	struct icmp_bxm icmp_param;
-	unsigned char pattern_buf[msg_size];
+	unsigned char *pattern_buf;
 	off_t pos;
+	int ret;
+
+	/*
+	 * This is ping, exec time is not that critical, so
+	 * rtdm_malloc() is ok for the purpose of echoing.
+	 */
+	pattern_buf = rtdm_malloc(msg_size);
+	if (pattern_buf == NULL)
+		return -ENOMEM;
 
 	/* first purge any potentially pending ICMP fragments */
 	rt_ip_frag_invalidate_socket(icmp_socket);
@@ -343,7 +352,10 @@ int rt_icmp_send_echo(u32 daddr, u16 id, u16 sequence, size_t msg_size)
 	}
 	icmp_param.data.buf = pattern_buf;
 
-	return rt_icmp_send_request(daddr, &icmp_param);
+	ret = rt_icmp_send_request(daddr, &icmp_param);
+	rtdm_free(pattern_buf);
+
+	return ret;
 }
 
 /***
-- 
2.29.2



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v2 7/7] drivers/net: cfg: fix config file load up
  2021-03-27 10:19 [PATCH v2 0/7] assorted v5.x related fixups Philippe Gerum
                   ` (5 preceding siblings ...)
  2021-03-27 10:19 ` [PATCH v2 6/7] drivers/net: icmp: remove variable-length array Philippe Gerum
@ 2021-03-27 10:19 ` Philippe Gerum
  2021-04-07 10:35   ` Jan Kiszka
  2021-04-07 10:39 ` [PATCH v2 0/7] assorted v5.x related fixups Jan Kiszka
  7 siblings, 1 reply; 27+ messages in thread
From: Philippe Gerum @ 2021-03-27 10:19 UTC (permalink / raw)
  To: xenomai

From: Philippe Gerum <rpm@xenomai.org>

set_fs() is on its way out, so we cannot open code a file read
operation by calling the VFS handler directly anymore, faking a user
address space.

We do have kernel interfaces for loading files though, particularly
kernel_read_file(). So let's use that one for loading the
configuration file contents. Unfortunately, the signature of this
service changed during the 5.9-rc cycle, so we have to resort to an
ugly wrapper to cope with all supported kernels once again. Sigh.

Signed-off-by: Philippe Gerum <rpm@xenomai.org>
---
 .../include/asm-generic/xenomai/wrappers.h    | 15 ++++
 kernel/drivers/net/stack/rtcfg/rtcfg_ioctl.c  | 74 +++++++++----------
 2 files changed, 52 insertions(+), 37 deletions(-)

diff --git a/kernel/cobalt/include/asm-generic/xenomai/wrappers.h b/kernel/cobalt/include/asm-generic/xenomai/wrappers.h
index cfd28fc47..f15fe4389 100644
--- a/kernel/cobalt/include/asm-generic/xenomai/wrappers.h
+++ b/kernel/cobalt/include/asm-generic/xenomai/wrappers.h
@@ -220,4 +220,19 @@ devm_hwmon_device_register_with_groups(struct device *dev, const char *name,
 	})
 #endif
 
+#if LINUX_VERSION_CODE < KERNEL_VERSION(5,9,0)
+#define read_file_from_kernel(__file, __buf, __buf_size, __file_size, __id) \
+	({								\
+		loff_t ___file_size;					\
+		int __ret;						\
+		__ret = kernel_read_file(__file, __buf, &___file_size,	\
+				__buf_size, __id);			\
+		(*__file_size) = ___file_size;				\
+		__ret;							\
+	})
+#else
+#define read_file_from_kernel(__file, __buf, __buf_size, __file_size, __id) \
+	kernel_read_file(__file, 0, __buf, __buf_size, __file_size, __id)
+#endif
+
 #endif /* _COBALT_ASM_GENERIC_WRAPPERS_H */
diff --git a/kernel/drivers/net/stack/rtcfg/rtcfg_ioctl.c b/kernel/drivers/net/stack/rtcfg/rtcfg_ioctl.c
index 769b4e143..e460571c2 100644
--- a/kernel/drivers/net/stack/rtcfg/rtcfg_ioctl.c
+++ b/kernel/drivers/net/stack/rtcfg/rtcfg_ioctl.c
@@ -22,6 +22,7 @@
  *
  */
 
+#include <linux/fs.h>
 #include <linux/file.h>
 #include <linux/vmalloc.h>
 
@@ -196,6 +197,40 @@ void cleanup_cmd_detach(void *priv_data)
 		kfree_rtskb(cmd->args.detach.stage2_chain);
 }
 
+static int load_cfg_file(struct rtcfg_file *cfgfile, struct rtcfg_cmd *cmd)
+{
+	size_t file_size = 0;
+	struct file *filp;
+	loff_t i_size;
+	int ret;
+
+	filp = filp_open(cfgfile->name, O_RDONLY, 0);
+	if (IS_ERR(filp))
+		return PTR_ERR(filp);
+
+	i_size = i_size_read(file_inode(filp));
+	if (i_size <= 0) {
+		/* allocate buffer even for empty files */
+		cfgfile->buffer = vmalloc(1);
+	} else {
+		cfgfile->buffer = NULL;
+		/* Assume 1Mb should be enough for a config file... */
+		ret = read_file_from_kernel(filp, &cfgfile->buffer,
+				1UL << 30, &file_size, READING_UNKNOWN);
+		if (ret < 0) {
+			fput(filp);
+			return ret;
+		}
+	}
+
+	fput(filp);
+	cfgfile->size = file_size;
+
+	/* dispatch again, this time with new file attached */
+	return rtpc_dispatch_call(rtcfg_event_handler, 0, cmd,
+				sizeof(*cmd), NULL, cleanup_cmd_add);
+}
+
 int rtcfg_ioctl_add(struct rtnet_device *rtdev, struct rtcfg_cmd *cmd)
 {
 	struct rtcfg_connection *conn_buf;
@@ -264,46 +299,11 @@ int rtcfg_ioctl_add(struct rtnet_device *rtdev, struct rtcfg_cmd *cmd)
 
 	/* load file if missing */
 	if (ret > 0) {
-		struct file *filp;
-		mm_segment_t oldfs;
-
-		filp = filp_open(file->name, O_RDONLY, 0);
-		if (IS_ERR(filp)) {
+		ret = load_cfg_file(file, cmd);
+		if (ret) {
 			rtcfg_unlockwr_proc(cmd->internal.data.ifindex);
-			ret = PTR_ERR(filp);
 			goto err;
 		}
-
-		file->size = filp->f_path.dentry->d_inode->i_size;
-
-		/* allocate buffer even for empty files */
-		file->buffer = vmalloc((file->size) ? file->size : 1);
-		if (file->buffer == NULL) {
-			rtcfg_unlockwr_proc(cmd->internal.data.ifindex);
-			fput(filp);
-			ret = -ENOMEM;
-			goto err;
-		}
-
-		oldfs = get_fs();
-		set_fs(KERNEL_DS);
-		filp->f_pos = 0;
-
-		ret = filp->f_op->read(filp, file->buffer, file->size,
-				       &filp->f_pos);
-
-		set_fs(oldfs);
-		fput(filp);
-
-		if (ret != (int)file->size) {
-			rtcfg_unlockwr_proc(cmd->internal.data.ifindex);
-			ret = -EIO;
-			goto err;
-		}
-
-		/* dispatch again, this time with new file attached */
-		ret = rtpc_dispatch_call(rtcfg_event_handler, 0, cmd,
-					 sizeof(*cmd), NULL, cleanup_cmd_add);
 	}
 
 	return ret;
-- 
2.29.2



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 3/7] cobalt/kernel: convert to proc_ops
  2021-03-27 10:19 ` [PATCH v2 3/7] cobalt/kernel: convert to proc_ops Philippe Gerum
@ 2021-04-07  9:52   ` Jan Kiszka
  2021-04-07 10:17     ` Philippe Gerum
  0 siblings, 1 reply; 27+ messages in thread
From: Jan Kiszka @ 2021-04-07  9:52 UTC (permalink / raw)
  To: Philippe Gerum, xenomai

On 27.03.21 11:19, Philippe Gerum wrote:
> From: Philippe Gerum <rpm@xenomai.org>
> 
> Signed-off-by: Philippe Gerum <rpm@xenomai.org>
> ---
>  .../include/asm-generic/xenomai/wrappers.h    | 20 ++++++
>  kernel/cobalt/vfile.c                         | 26 ++++----
>  kernel/drivers/analogy/device.c               | 12 ++--
>  kernel/drivers/can/mscan/rtcan_mscan_proc.c   | 12 ++--
>  kernel/drivers/can/rtcan_module.c             | 66 +++++++++----------
>  .../drivers/can/sja1000/rtcan_sja1000_proc.c  | 12 ++--
>  6 files changed, 80 insertions(+), 68 deletions(-)
> 
> diff --git a/kernel/cobalt/include/asm-generic/xenomai/wrappers.h b/kernel/cobalt/include/asm-generic/xenomai/wrappers.h
> index cd22a8db5..cc0cb0896 100644
> --- a/kernel/cobalt/include/asm-generic/xenomai/wrappers.h
> +++ b/kernel/cobalt/include/asm-generic/xenomai/wrappers.h
> @@ -170,6 +170,26 @@ devm_hwmon_device_register_with_groups(struct device *dev, const char *name,
>  #define __kernel_old_timeval	timeval
>  #endif
>  
> +#if LINUX_VERSION_CODE < KERNEL_VERSION(5,6,0)
> +#define DEFINE_PROC_OPS(__name, __open, __release, __read, __write) \
> +	struct file_operations __name = {			    \
> +		.open = (__open),				    \
> +		.release = (__release),				    \
> +		.read = (__read),				    \
> +		.write = (__write),				    \
> +		.llseek = seq_lseek,				    \
> +}
> +#else
> +#define DEFINE_PROC_OPS(__name, __open, __release, __read, __write)	\
> +	struct proc_ops __name = {					\
> +		.proc_open = (__open),					\
> +		.proc_release = (__release),				\
> +		.proc_read = (__read),					\
> +		.proc_write = (__write),				\
> +		.proc_lseek = seq_lseek,				\
> +}
> +#endif
> +
>  #if LINUX_VERSION_CODE < KERNEL_VERSION(5,8,0)
>  #define old_timespec32    compat_timespec
>  #define old_itimerspec32  compat_itimerspec
> diff --git a/kernel/cobalt/vfile.c b/kernel/cobalt/vfile.c
> index f65d46ddf..e9e10ce8d 100644
> --- a/kernel/cobalt/vfile.c
> +++ b/kernel/cobalt/vfile.c
> @@ -340,13 +340,12 @@ ssize_t vfile_snapshot_write(struct file *file, const char __user *buf,
>  	return ret;
>  }
>  
> -static struct file_operations vfile_snapshot_fops = {
> -	.open = vfile_snapshot_open,
> -	.read = seq_read,
> -	.write = vfile_snapshot_write,
> -	.llseek = seq_lseek,
> -	.release = vfile_snapshot_release,
> -};
> +static const DEFINE_PROC_OPS(vfile_snapshot_fops,
> +		       vfile_snapshot_open,
> +		       vfile_snapshot_release,
> +		       seq_read,
> +		       vfile_snapshot_write
> +);
>  
>  /**
>   * @fn int xnvfile_init_snapshot(const char *name, struct xnvfile_snapshot *vfile, struct xnvfile_directory *parent)
> @@ -592,13 +591,12 @@ ssize_t vfile_regular_write(struct file *file, const char __user *buf,
>  	return ret;
>  }
>  
> -static struct file_operations vfile_regular_fops = {
> -	.open = vfile_regular_open,
> -	.read = seq_read,
> -	.write = vfile_regular_write,
> -	.llseek = seq_lseek,
> -	.release = vfile_regular_release,
> -};
> +static const DEFINE_PROC_OPS(vfile_regular_fops,
> +		       vfile_regular_open,
> +		       vfile_regular_release,
> +		       seq_read,
> +		       vfile_regular_write
> +);
>  
>  /**
>   * @fn int xnvfile_init_regular(const char *name, struct xnvfile_regular *vfile, struct xnvfile_directory *parent)
> diff --git a/kernel/drivers/analogy/device.c b/kernel/drivers/analogy/device.c
> index 160dcf547..6ed588708 100644
> --- a/kernel/drivers/analogy/device.c
> +++ b/kernel/drivers/analogy/device.c
> @@ -95,12 +95,12 @@ static int a4l_proc_transfer_open(struct inode *inode, struct file *file)
>  	return single_open(file, a4l_rdproc_transfer, PDE_DATA(inode));
>  }
>  
> -static const struct file_operations a4l_proc_transfer_ops = {
> -	.open		= a4l_proc_transfer_open,
> -	.read		= seq_read,
> -	.llseek		= seq_lseek,
> -	.release	= single_release,
> -};
> +static const DEFINE_PROC_OPS(a4l_proc_transfer_ops,
> +			     a4l_proc_transfer_open,
> +			     single_release,
> +			     seq_read,
> +			     NULL
> +);
>  
>  int a4l_proc_attach(struct a4l_device_context * cxt)
>  {
> diff --git a/kernel/drivers/can/mscan/rtcan_mscan_proc.c b/kernel/drivers/can/mscan/rtcan_mscan_proc.c
> index 6b54ad4c7..732a02765 100644
> --- a/kernel/drivers/can/mscan/rtcan_mscan_proc.c
> +++ b/kernel/drivers/can/mscan/rtcan_mscan_proc.c
> @@ -114,12 +114,12 @@ static int rtcan_mscan_proc_regs_open(struct inode *inode, struct file *file)
>  	return single_open(file, rtcan_mscan_proc_regs, PDE_DATA(inode));
>  }
>  
> -static const struct file_operations rtcan_mscan_proc_regs_ops = {
> -	.open		= rtcan_mscan_proc_regs_open,
> -	.read		= seq_read,
> -	.llseek		= seq_lseek,
> -	.release	= single_release,
> -};
> +static const DEFINE_PROC_OPS(rtcan_mscan_proc_regs_ops,
> +			rtcan_mscan_proc_regs_open,
> +			single_elease,
> +			seq_read,
> +			NULL
> +);
>  
>  int rtcan_mscan_create_proc(struct rtcan_device* dev)
>  {
> diff --git a/kernel/drivers/can/rtcan_module.c b/kernel/drivers/can/rtcan_module.c
> index fbc5c35e4..7f3d4c395 100644
> --- a/kernel/drivers/can/rtcan_module.c
> +++ b/kernel/drivers/can/rtcan_module.c
> @@ -157,12 +157,12 @@ static int rtcan_proc_devices_open(struct inode *inode, struct file *file)
>  	return single_open(file, rtcan_read_proc_devices, NULL);
>  }
>  
> -static const struct file_operations rtcan_proc_devices_ops = {
> -	.open		= rtcan_proc_devices_open,
> -	.read		= seq_read,
> -	.llseek		= seq_lseek,
> -	.release	= single_release,
> -};
> +static const DEFINE_PROC_OPS(rtcan_proc_devices_ops,
> +			rtcan_proc_devices_open,
> +			single_release,
> +			seq_read,
> +			NULL
> +);
>  
>  static int rtcan_read_proc_sockets(struct seq_file *p, void *data)
>  {
> @@ -220,13 +220,12 @@ static int rtcan_proc_sockets_open(struct inode *inode, struct file *file)
>  	return single_open(file, rtcan_read_proc_sockets, NULL);
>  }
>  
> -static const struct file_operations rtcan_proc_sockets_ops = {
> -	.open		= rtcan_proc_sockets_open,
> -	.read		= seq_read,
> -	.llseek		= seq_lseek,
> -	.release	= single_release,
> -};
> -
> +static const DEFINE_PROC_OPS(rtcan_proc_sockets_ops,
> +			rtcan_proc_sockets_open,
> +			single_release,
> +			seq_read,
> +			NULL
> +);
>  
>  static int rtcan_read_proc_info(struct seq_file *p, void *data)
>  {
> @@ -271,14 +270,12 @@ static int rtcan_proc_info_open(struct inode *inode, struct file *file)
>  	return single_open(file, rtcan_read_proc_info, PDE_DATA(inode));
>  }
>  
> -static const struct file_operations rtcan_proc_info_ops = {
> -	.open		= rtcan_proc_info_open,
> -	.read		= seq_read,
> -	.llseek		= seq_lseek,
> -	.release	= single_release,
> -};
> -
> -
> +static const DEFINE_PROC_OPS(rtcan_proc_info_ops,
> +			rtcan_proc_info_open,
> +			single_release,
> +			seq_read,
> +			NULL
> +);
>  
>  static int rtcan_read_proc_filter(struct seq_file *p, void *data)
>  {
> @@ -319,14 +316,12 @@ static int rtcan_proc_filter_open(struct inode *inode, struct file *file)
>  	return single_open(file, rtcan_read_proc_filter, PDE_DATA(inode));
>  }
>  
> -static const struct file_operations rtcan_proc_filter_ops = {
> -	.open		= rtcan_proc_filter_open,
> -	.read		= seq_read,
> -	.llseek		= seq_lseek,
> -	.release	= single_release,
> -};
> -
> -
> +static const DEFINE_PROC_OPS(rtcan_proc_filter_ops,
> +			rtcan_proc_filter_open,
> +			single_release,
> +			seq_read,
> +			NULL
> +);
>  
>  static int rtcan_read_proc_version(struct seq_file *p, void *data)
>  {
> @@ -341,13 +336,12 @@ static int rtcan_proc_version_open(struct inode *inode, struct file *file)
>  	return single_open(file, rtcan_read_proc_version, NULL);
>  }
>  
> -static const struct file_operations rtcan_proc_version_ops = {
> -	.open		= rtcan_proc_version_open,
> -	.read		= seq_read,
> -	.llseek		= seq_lseek,
> -	.release	= single_release,
> -};
> -
> +static const DEFINE_PROC_OPS(rtcan_proc_version_ops,
> +			rtcan_proc_version_open,
> +			single_release,
> +			seq_read,
> +			NULL
> +);
>  
>  void rtcan_dev_remove_proc(struct rtcan_device* dev)
>  {
> diff --git a/kernel/drivers/can/sja1000/rtcan_sja1000_proc.c b/kernel/drivers/can/sja1000/rtcan_sja1000_proc.c
> index b4af8ab2e..0fdee8c37 100644
> --- a/kernel/drivers/can/sja1000/rtcan_sja1000_proc.c
> +++ b/kernel/drivers/can/sja1000/rtcan_sja1000_proc.c
> @@ -50,12 +50,12 @@ static int rtcan_sja_proc_regs_open(struct inode *inode, struct file *file)
>  	return single_open(file, rtcan_sja_proc_regs, PDE_DATA(inode));
>  }
>  
> -static const struct file_operations rtcan_sja_proc_regs_ops = {
> -	.open		= rtcan_sja_proc_regs_open,
> -	.read		= seq_read,
> -	.llseek		= seq_lseek,
> -	.release	= single_release,
> -};
> +static const DEFINE_PROC_OPS(rtcan_sja_proc_regs_ops,
> +			rtcan_sja_proc_regs_open,
> +			single_release,
> +			seq_read,
> +			NULL
> +);
>  
>  int rtcan_sja_create_proc(struct rtcan_device* dev)
>  {
> 

This leaves out a4l_proc_devs_ops and a4l_proc_drvs_ops, while
converting a4l_proc_transfer_ops. By accident? In any case, I've added
those two cases to the patch.

Jan

-- 
Siemens AG, T RDA IOT
Corporate Competence Center Embedded Linux


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 5/7] drivers/net: wrap csum_partial_copy_nocheck()
  2021-03-27 10:19 ` [PATCH v2 5/7] drivers/net: wrap csum_partial_copy_nocheck() Philippe Gerum
@ 2021-04-07 10:06   ` Jan Kiszka
  2021-04-15  7:21     ` Philippe Gerum
  0 siblings, 1 reply; 27+ messages in thread
From: Jan Kiszka @ 2021-04-07 10:06 UTC (permalink / raw)
  To: Philippe Gerum, xenomai

On 27.03.21 11:19, Philippe Gerum wrote:
> From: Philippe Gerum <rpm@xenomai.org>
> 
> Since v5.9-rc1, csum_partial_copy_nocheck() forces a zero seed as its
> last argument to csum_partial(). According to #cc44c17baf7f3, passing
> a non-zero value would not even yield the proper result on some
> architectures.
> 
> Nevertheless, the current ICMP code does expect a non-zero csum seed
> to be used in the next computation, so let's wrap net_csum_copy() to
> csum_partial_copy_nocheck() for pre-5.9 kernels, and open code it for
> later kernels so that we still feed csum_partial() with the user-given
> csum. We still expect the x86, ARM and arm64 implementations of
> csum_partial() to do the right thing.
> 

If that issue only affects the ICMP code path, why not only changing
that, leaving rtskb_copy_and_csum_bits with the benefit of doing copy
and csum in one step?

Jan

> Signed-off-by: Philippe Gerum <rpm@xenomai.org>
> ---
>  .../include/asm-generic/xenomai/wrappers.h     | 11 +++++++++++
>  kernel/drivers/net/stack/ipv4/icmp.c           | 18 +++++++++---------
>  kernel/drivers/net/stack/rtskb.c               |  3 +--
>  3 files changed, 21 insertions(+), 11 deletions(-)
> 
> diff --git a/kernel/cobalt/include/asm-generic/xenomai/wrappers.h b/kernel/cobalt/include/asm-generic/xenomai/wrappers.h
> index cc0cb0896..cfd28fc47 100644
> --- a/kernel/cobalt/include/asm-generic/xenomai/wrappers.h
> +++ b/kernel/cobalt/include/asm-generic/xenomai/wrappers.h
> @@ -209,4 +209,15 @@ devm_hwmon_device_register_with_groups(struct device *dev, const char *name,
>  #define vmalloc_kernel(__size, __flags)	__vmalloc(__size, GFP_KERNEL|__flags)
>  #endif
>  
> +#if LINUX_VERSION_CODE < KERNEL_VERSION(5,9,0)
> +#define net_csum_copy(__src, __dst, __len, __csum)	\
> +	csum_partial_copy_nocheck(__src, __dst, __len, __csum)
> +#else
> +#define net_csum_copy(__src, __dst, __len, __csum)			\
> +	({								\
> +		memcpy(__dst, __src, __len);				\
> +		csum_partial(__dst, __len, (__force __wsum)__csum);	\
> +	})
> +#endif
> +
>  #endif /* _COBALT_ASM_GENERIC_WRAPPERS_H */
> diff --git a/kernel/drivers/net/stack/ipv4/icmp.c b/kernel/drivers/net/stack/ipv4/icmp.c
> index b723040aa..0671cbc93 100644
> --- a/kernel/drivers/net/stack/ipv4/icmp.c
> +++ b/kernel/drivers/net/stack/ipv4/icmp.c
> @@ -142,9 +142,9 @@ static int rt_icmp_glue_reply_bits(const void *p, unsigned char *to,
>  	if (offset != 0)
>  		return -EMSGSIZE;
>  
> -	csum = csum_partial_copy_nocheck((void *)&icmp_param->head, to,
> -					 icmp_param->head_len,
> -					 icmp_param->csum);
> +	csum = net_csum_copy((void *)&icmp_param->head, to,
> +			     icmp_param->head_len,
> +			     icmp_param->csum);
>  
>  	csum = rtskb_copy_and_csum_bits(icmp_param->data.skb,
>  					icmp_param->offset,
> @@ -259,13 +259,13 @@ static int rt_icmp_glue_request_bits(const void *p, unsigned char *to,
>  			    __FUNCTION__);
>  		return -1;);
>  
> -	csum = csum_partial_copy_nocheck((void *)&icmp_param->head, to,
> -					 icmp_param->head_len,
> -					 icmp_param->csum);
> +	csum = net_csum_copy((void *)&icmp_param->head, to,
> +			     icmp_param->head_len,
> +			     icmp_param->csum);
>  
> -	csum = csum_partial_copy_nocheck(icmp_param->data.buf,
> -					 to + icmp_param->head_len,
> -					 fraglen - icmp_param->head_len, csum);
> +	csum = net_csum_copy(icmp_param->data.buf,
> +			     to + icmp_param->head_len,
> +			     fraglen - icmp_param->head_len, csum);
>  
>  	icmph = (struct icmphdr *)to;
>  
> diff --git a/kernel/drivers/net/stack/rtskb.c b/kernel/drivers/net/stack/rtskb.c
> index 84d021d24..0d0c66e58 100644
> --- a/kernel/drivers/net/stack/rtskb.c
> +++ b/kernel/drivers/net/stack/rtskb.c
> @@ -69,8 +69,7 @@ unsigned int rtskb_copy_and_csum_bits(const struct rtskb *skb, int offset,
>  	if ((copy = skb->len - offset) > 0) {
>  		if (copy > len)
>  			copy = len;
> -		csum = csum_partial_copy_nocheck(skb->data + offset, to, copy,
> -						 csum);
> +		csum = net_csum_copy(skb->data + offset, to, copy, csum);
>  		if ((len -= copy) == 0)
>  			return csum;
>  		offset += copy;
> 

-- 
Siemens AG, T RDA IOT
Corporate Competence Center Embedded Linux


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 3/7] cobalt/kernel: convert to proc_ops
  2021-04-07  9:52   ` Jan Kiszka
@ 2021-04-07 10:17     ` Philippe Gerum
  2021-04-07 10:37       ` Jan Kiszka
  0 siblings, 1 reply; 27+ messages in thread
From: Philippe Gerum @ 2021-04-07 10:17 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: xenomai


Jan Kiszka <jan.kiszka@siemens.com> writes:

> On 27.03.21 11:19, Philippe Gerum wrote:
>> From: Philippe Gerum <rpm@xenomai.org>
>> 
>> Signed-off-by: Philippe Gerum <rpm@xenomai.org>
>> ---
>>  .../include/asm-generic/xenomai/wrappers.h    | 20 ++++++
>>  kernel/cobalt/vfile.c                         | 26 ++++----
>>  kernel/drivers/analogy/device.c               | 12 ++--
>>  kernel/drivers/can/mscan/rtcan_mscan_proc.c   | 12 ++--
>>  kernel/drivers/can/rtcan_module.c             | 66 +++++++++----------
>>  .../drivers/can/sja1000/rtcan_sja1000_proc.c  | 12 ++--
>>  6 files changed, 80 insertions(+), 68 deletions(-)
>> 
>> diff --git a/kernel/cobalt/include/asm-generic/xenomai/wrappers.h b/kernel/cobalt/include/asm-generic/xenomai/wrappers.h
>> index cd22a8db5..cc0cb0896 100644
>> --- a/kernel/cobalt/include/asm-generic/xenomai/wrappers.h
>> +++ b/kernel/cobalt/include/asm-generic/xenomai/wrappers.h
>> @@ -170,6 +170,26 @@ devm_hwmon_device_register_with_groups(struct device *dev, const char *name,
>>  #define __kernel_old_timeval	timeval
>>  #endif
>>  
>> +#if LINUX_VERSION_CODE < KERNEL_VERSION(5,6,0)
>> +#define DEFINE_PROC_OPS(__name, __open, __release, __read, __write) \
>> +	struct file_operations __name = {			    \
>> +		.open = (__open),				    \
>> +		.release = (__release),				    \
>> +		.read = (__read),				    \
>> +		.write = (__write),				    \
>> +		.llseek = seq_lseek,				    \
>> +}
>> +#else
>> +#define DEFINE_PROC_OPS(__name, __open, __release, __read, __write)	\
>> +	struct proc_ops __name = {					\
>> +		.proc_open = (__open),					\
>> +		.proc_release = (__release),				\
>> +		.proc_read = (__read),					\
>> +		.proc_write = (__write),				\
>> +		.proc_lseek = seq_lseek,				\
>> +}
>> +#endif
>> +
>>  #if LINUX_VERSION_CODE < KERNEL_VERSION(5,8,0)
>>  #define old_timespec32    compat_timespec
>>  #define old_itimerspec32  compat_itimerspec
>> diff --git a/kernel/cobalt/vfile.c b/kernel/cobalt/vfile.c
>> index f65d46ddf..e9e10ce8d 100644
>> --- a/kernel/cobalt/vfile.c
>> +++ b/kernel/cobalt/vfile.c
>> @@ -340,13 +340,12 @@ ssize_t vfile_snapshot_write(struct file *file, const char __user *buf,
>>  	return ret;
>>  }
>>  
>> -static struct file_operations vfile_snapshot_fops = {
>> -	.open = vfile_snapshot_open,
>> -	.read = seq_read,
>> -	.write = vfile_snapshot_write,
>> -	.llseek = seq_lseek,
>> -	.release = vfile_snapshot_release,
>> -};
>> +static const DEFINE_PROC_OPS(vfile_snapshot_fops,
>> +		       vfile_snapshot_open,
>> +		       vfile_snapshot_release,
>> +		       seq_read,
>> +		       vfile_snapshot_write
>> +);
>>  
>>  /**
>>   * @fn int xnvfile_init_snapshot(const char *name, struct xnvfile_snapshot *vfile, struct xnvfile_directory *parent)
>> @@ -592,13 +591,12 @@ ssize_t vfile_regular_write(struct file *file, const char __user *buf,
>>  	return ret;
>>  }
>>  
>> -static struct file_operations vfile_regular_fops = {
>> -	.open = vfile_regular_open,
>> -	.read = seq_read,
>> -	.write = vfile_regular_write,
>> -	.llseek = seq_lseek,
>> -	.release = vfile_regular_release,
>> -};
>> +static const DEFINE_PROC_OPS(vfile_regular_fops,
>> +		       vfile_regular_open,
>> +		       vfile_regular_release,
>> +		       seq_read,
>> +		       vfile_regular_write
>> +);
>>  
>>  /**
>>   * @fn int xnvfile_init_regular(const char *name, struct xnvfile_regular *vfile, struct xnvfile_directory *parent)
>> diff --git a/kernel/drivers/analogy/device.c b/kernel/drivers/analogy/device.c
>> index 160dcf547..6ed588708 100644
>> --- a/kernel/drivers/analogy/device.c
>> +++ b/kernel/drivers/analogy/device.c
>> @@ -95,12 +95,12 @@ static int a4l_proc_transfer_open(struct inode *inode, struct file *file)
>>  	return single_open(file, a4l_rdproc_transfer, PDE_DATA(inode));
>>  }
>>  
>> -static const struct file_operations a4l_proc_transfer_ops = {
>> -	.open		= a4l_proc_transfer_open,
>> -	.read		= seq_read,
>> -	.llseek		= seq_lseek,
>> -	.release	= single_release,
>> -};
>> +static const DEFINE_PROC_OPS(a4l_proc_transfer_ops,
>> +			     a4l_proc_transfer_open,
>> +			     single_release,
>> +			     seq_read,
>> +			     NULL
>> +);
>>  
>>  int a4l_proc_attach(struct a4l_device_context * cxt)
>>  {
>> diff --git a/kernel/drivers/can/mscan/rtcan_mscan_proc.c b/kernel/drivers/can/mscan/rtcan_mscan_proc.c
>> index 6b54ad4c7..732a02765 100644
>> --- a/kernel/drivers/can/mscan/rtcan_mscan_proc.c
>> +++ b/kernel/drivers/can/mscan/rtcan_mscan_proc.c
>> @@ -114,12 +114,12 @@ static int rtcan_mscan_proc_regs_open(struct inode *inode, struct file *file)
>>  	return single_open(file, rtcan_mscan_proc_regs, PDE_DATA(inode));
>>  }
>>  
>> -static const struct file_operations rtcan_mscan_proc_regs_ops = {
>> -	.open		= rtcan_mscan_proc_regs_open,
>> -	.read		= seq_read,
>> -	.llseek		= seq_lseek,
>> -	.release	= single_release,
>> -};
>> +static const DEFINE_PROC_OPS(rtcan_mscan_proc_regs_ops,
>> +			rtcan_mscan_proc_regs_open,
>> +			single_elease,
>> +			seq_read,
>> +			NULL
>> +);
>>  
>>  int rtcan_mscan_create_proc(struct rtcan_device* dev)
>>  {
>> diff --git a/kernel/drivers/can/rtcan_module.c b/kernel/drivers/can/rtcan_module.c
>> index fbc5c35e4..7f3d4c395 100644
>> --- a/kernel/drivers/can/rtcan_module.c
>> +++ b/kernel/drivers/can/rtcan_module.c
>> @@ -157,12 +157,12 @@ static int rtcan_proc_devices_open(struct inode *inode, struct file *file)
>>  	return single_open(file, rtcan_read_proc_devices, NULL);
>>  }
>>  
>> -static const struct file_operations rtcan_proc_devices_ops = {
>> -	.open		= rtcan_proc_devices_open,
>> -	.read		= seq_read,
>> -	.llseek		= seq_lseek,
>> -	.release	= single_release,
>> -};
>> +static const DEFINE_PROC_OPS(rtcan_proc_devices_ops,
>> +			rtcan_proc_devices_open,
>> +			single_release,
>> +			seq_read,
>> +			NULL
>> +);
>>  
>>  static int rtcan_read_proc_sockets(struct seq_file *p, void *data)
>>  {
>> @@ -220,13 +220,12 @@ static int rtcan_proc_sockets_open(struct inode *inode, struct file *file)
>>  	return single_open(file, rtcan_read_proc_sockets, NULL);
>>  }
>>  
>> -static const struct file_operations rtcan_proc_sockets_ops = {
>> -	.open		= rtcan_proc_sockets_open,
>> -	.read		= seq_read,
>> -	.llseek		= seq_lseek,
>> -	.release	= single_release,
>> -};
>> -
>> +static const DEFINE_PROC_OPS(rtcan_proc_sockets_ops,
>> +			rtcan_proc_sockets_open,
>> +			single_release,
>> +			seq_read,
>> +			NULL
>> +);
>>  
>>  static int rtcan_read_proc_info(struct seq_file *p, void *data)
>>  {
>> @@ -271,14 +270,12 @@ static int rtcan_proc_info_open(struct inode *inode, struct file *file)
>>  	return single_open(file, rtcan_read_proc_info, PDE_DATA(inode));
>>  }
>>  
>> -static const struct file_operations rtcan_proc_info_ops = {
>> -	.open		= rtcan_proc_info_open,
>> -	.read		= seq_read,
>> -	.llseek		= seq_lseek,
>> -	.release	= single_release,
>> -};
>> -
>> -
>> +static const DEFINE_PROC_OPS(rtcan_proc_info_ops,
>> +			rtcan_proc_info_open,
>> +			single_release,
>> +			seq_read,
>> +			NULL
>> +);
>>  
>>  static int rtcan_read_proc_filter(struct seq_file *p, void *data)
>>  {
>> @@ -319,14 +316,12 @@ static int rtcan_proc_filter_open(struct inode *inode, struct file *file)
>>  	return single_open(file, rtcan_read_proc_filter, PDE_DATA(inode));
>>  }
>>  
>> -static const struct file_operations rtcan_proc_filter_ops = {
>> -	.open		= rtcan_proc_filter_open,
>> -	.read		= seq_read,
>> -	.llseek		= seq_lseek,
>> -	.release	= single_release,
>> -};
>> -
>> -
>> +static const DEFINE_PROC_OPS(rtcan_proc_filter_ops,
>> +			rtcan_proc_filter_open,
>> +			single_release,
>> +			seq_read,
>> +			NULL
>> +);
>>  
>>  static int rtcan_read_proc_version(struct seq_file *p, void *data)
>>  {
>> @@ -341,13 +336,12 @@ static int rtcan_proc_version_open(struct inode *inode, struct file *file)
>>  	return single_open(file, rtcan_read_proc_version, NULL);
>>  }
>>  
>> -static const struct file_operations rtcan_proc_version_ops = {
>> -	.open		= rtcan_proc_version_open,
>> -	.read		= seq_read,
>> -	.llseek		= seq_lseek,
>> -	.release	= single_release,
>> -};
>> -
>> +static const DEFINE_PROC_OPS(rtcan_proc_version_ops,
>> +			rtcan_proc_version_open,
>> +			single_release,
>> +			seq_read,
>> +			NULL
>> +);
>>  
>>  void rtcan_dev_remove_proc(struct rtcan_device* dev)
>>  {
>> diff --git a/kernel/drivers/can/sja1000/rtcan_sja1000_proc.c b/kernel/drivers/can/sja1000/rtcan_sja1000_proc.c
>> index b4af8ab2e..0fdee8c37 100644
>> --- a/kernel/drivers/can/sja1000/rtcan_sja1000_proc.c
>> +++ b/kernel/drivers/can/sja1000/rtcan_sja1000_proc.c
>> @@ -50,12 +50,12 @@ static int rtcan_sja_proc_regs_open(struct inode *inode, struct file *file)
>>  	return single_open(file, rtcan_sja_proc_regs, PDE_DATA(inode));
>>  }
>>  
>> -static const struct file_operations rtcan_sja_proc_regs_ops = {
>> -	.open		= rtcan_sja_proc_regs_open,
>> -	.read		= seq_read,
>> -	.llseek		= seq_lseek,
>> -	.release	= single_release,
>> -};
>> +static const DEFINE_PROC_OPS(rtcan_sja_proc_regs_ops,
>> +			rtcan_sja_proc_regs_open,
>> +			single_release,
>> +			seq_read,
>> +			NULL
>> +);
>>  
>>  int rtcan_sja_create_proc(struct rtcan_device* dev)
>>  {
>> 
>
> This leaves out a4l_proc_devs_ops and a4l_proc_drvs_ops, while
> converting a4l_proc_transfer_ops. By accident? In any case, I've added
> those two cases to the patch.
>

Not by accident, Analogy is off the map as far as I'm concerned. I would
simply disable it from the CI stuff, until a patch dropping it entirely
is pushed upstream.

-- 
Philippe.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 6/7] drivers/net: icmp: remove variable-length array
  2021-03-27 10:19 ` [PATCH v2 6/7] drivers/net: icmp: remove variable-length array Philippe Gerum
@ 2021-04-07 10:24   ` Jan Kiszka
  0 siblings, 0 replies; 27+ messages in thread
From: Jan Kiszka @ 2021-04-07 10:24 UTC (permalink / raw)
  To: Philippe Gerum, xenomai

On 27.03.21 11:19, Philippe Gerum wrote:
> From: Philippe Gerum <rpm@xenomai.org>
> 
> Signed-off-by: Philippe Gerum <rpm@xenomai.org>
> ---
>  kernel/drivers/net/stack/ipv4/icmp.c | 16 ++++++++++++++--
>  1 file changed, 14 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/drivers/net/stack/ipv4/icmp.c b/kernel/drivers/net/stack/ipv4/icmp.c
> index 0671cbc93..c3563595e 100644
> --- a/kernel/drivers/net/stack/ipv4/icmp.c
> +++ b/kernel/drivers/net/stack/ipv4/icmp.c
> @@ -313,8 +313,17 @@ static int rt_icmp_send_request(u32 daddr, struct icmp_bxm *icmp_param)
>  int rt_icmp_send_echo(u32 daddr, u16 id, u16 sequence, size_t msg_size)
>  {
>  	struct icmp_bxm icmp_param;
> -	unsigned char pattern_buf[msg_size];
> +	unsigned char *pattern_buf;
>  	off_t pos;
> +	int ret;
> +
> +	/*
> +	 * This is ping, exec time is not that critical, so
> +	 * rtdm_malloc() is ok for the purpose of echoing.

Solution is fine but comment is misleading: This is the submission path,
not the reflection. We are setting up the ping message here, and this is
indeed not time-critical. I'll fix that up on merge.

> +	 */
> +	pattern_buf = rtdm_malloc(msg_size);
> +	if (pattern_buf == NULL)
> +		return -ENOMEM;
>  
>  	/* first purge any potentially pending ICMP fragments */
>  	rt_ip_frag_invalidate_socket(icmp_socket);
> @@ -343,7 +352,10 @@ int rt_icmp_send_echo(u32 daddr, u16 id, u16 sequence, size_t msg_size)
>  	}
>  	icmp_param.data.buf = pattern_buf;
>  
> -	return rt_icmp_send_request(daddr, &icmp_param);
> +	ret = rt_icmp_send_request(daddr, &icmp_param);
> +	rtdm_free(pattern_buf);
> +
> +	return ret;
>  }
>  
>  /***
> 

Jan

-- 
Siemens AG, T RDA IOT
Corporate Competence Center Embedded Linux


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 7/7] drivers/net: cfg: fix config file load up
  2021-03-27 10:19 ` [PATCH v2 7/7] drivers/net: cfg: fix config file load up Philippe Gerum
@ 2021-04-07 10:35   ` Jan Kiszka
  2021-05-04 17:18     ` Philippe Gerum
  0 siblings, 1 reply; 27+ messages in thread
From: Jan Kiszka @ 2021-04-07 10:35 UTC (permalink / raw)
  To: Philippe Gerum, xenomai

On 27.03.21 11:19, Philippe Gerum wrote:
> From: Philippe Gerum <rpm@xenomai.org>
> 
> set_fs() is on its way out, so we cannot open code a file read
> operation by calling the VFS handler directly anymore, faking a user
> address space.
> 
> We do have kernel interfaces for loading files though, particularly
> kernel_read_file(). So let's use that one for loading the
> configuration file contents. Unfortunately, the signature of this
> service changed during the 5.9-rc cycle, so we have to resort to an
> ugly wrapper to cope with all supported kernels once again. Sigh.
> 
> Signed-off-by: Philippe Gerum <rpm@xenomai.org>
> ---
>  .../include/asm-generic/xenomai/wrappers.h    | 15 ++++
>  kernel/drivers/net/stack/rtcfg/rtcfg_ioctl.c  | 74 +++++++++----------
>  2 files changed, 52 insertions(+), 37 deletions(-)
> 
> diff --git a/kernel/cobalt/include/asm-generic/xenomai/wrappers.h b/kernel/cobalt/include/asm-generic/xenomai/wrappers.h
> index cfd28fc47..f15fe4389 100644
> --- a/kernel/cobalt/include/asm-generic/xenomai/wrappers.h
> +++ b/kernel/cobalt/include/asm-generic/xenomai/wrappers.h
> @@ -220,4 +220,19 @@ devm_hwmon_device_register_with_groups(struct device *dev, const char *name,
>  	})
>  #endif
>  
> +#if LINUX_VERSION_CODE < KERNEL_VERSION(5,9,0)
> +#define read_file_from_kernel(__file, __buf, __buf_size, __file_size, __id) \
> +	({								\
> +		loff_t ___file_size;					\
> +		int __ret;						\
> +		__ret = kernel_read_file(__file, __buf, &___file_size,	\
> +				__buf_size, __id);			\
> +		(*__file_size) = ___file_size;				\
> +		__ret;							\
> +	})
> +#else
> +#define read_file_from_kernel(__file, __buf, __buf_size, __file_size, __id) \
> +	kernel_read_file(__file, 0, __buf, __buf_size, __file_size, __id)
> +#endif
> +
>  #endif /* _COBALT_ASM_GENERIC_WRAPPERS_H */
> diff --git a/kernel/drivers/net/stack/rtcfg/rtcfg_ioctl.c b/kernel/drivers/net/stack/rtcfg/rtcfg_ioctl.c
> index 769b4e143..e460571c2 100644
> --- a/kernel/drivers/net/stack/rtcfg/rtcfg_ioctl.c
> +++ b/kernel/drivers/net/stack/rtcfg/rtcfg_ioctl.c
> @@ -22,6 +22,7 @@
>   *
>   */
>  
> +#include <linux/fs.h>
>  #include <linux/file.h>
>  #include <linux/vmalloc.h>
>  
> @@ -196,6 +197,40 @@ void cleanup_cmd_detach(void *priv_data)
>  		kfree_rtskb(cmd->args.detach.stage2_chain);
>  }
>  
> +static int load_cfg_file(struct rtcfg_file *cfgfile, struct rtcfg_cmd *cmd)
> +{
> +	size_t file_size = 0;
> +	struct file *filp;
> +	loff_t i_size;
> +	int ret;
> +
> +	filp = filp_open(cfgfile->name, O_RDONLY, 0);
> +	if (IS_ERR(filp))
> +		return PTR_ERR(filp);
> +
> +	i_size = i_size_read(file_inode(filp));
> +	if (i_size <= 0) {
> +		/* allocate buffer even for empty files */
> +		cfgfile->buffer = vmalloc(1);
> +	} else {
> +		cfgfile->buffer = NULL;
> +		/* Assume 1Mb should be enough for a config file... */

This limitation is new, and I'm not sure if that would be a good idea.
But I need to read up the protocol details again.

Why do we need that limit? We know i_size already, no?

Jan

> +		ret = read_file_from_kernel(filp, &cfgfile->buffer,
> +				1UL << 30, &file_size, READING_UNKNOWN);
> +		if (ret < 0) {
> +			fput(filp);
> +			return ret;
> +		}
> +	}
> +
> +	fput(filp);
> +	cfgfile->size = file_size;
> +
> +	/* dispatch again, this time with new file attached */
> +	return rtpc_dispatch_call(rtcfg_event_handler, 0, cmd,
> +				sizeof(*cmd), NULL, cleanup_cmd_add);
> +}
> +
>  int rtcfg_ioctl_add(struct rtnet_device *rtdev, struct rtcfg_cmd *cmd)
>  {
>  	struct rtcfg_connection *conn_buf;
> @@ -264,46 +299,11 @@ int rtcfg_ioctl_add(struct rtnet_device *rtdev, struct rtcfg_cmd *cmd)
>  
>  	/* load file if missing */
>  	if (ret > 0) {
> -		struct file *filp;
> -		mm_segment_t oldfs;
> -
> -		filp = filp_open(file->name, O_RDONLY, 0);
> -		if (IS_ERR(filp)) {
> +		ret = load_cfg_file(file, cmd);
> +		if (ret) {
>  			rtcfg_unlockwr_proc(cmd->internal.data.ifindex);
> -			ret = PTR_ERR(filp);
>  			goto err;
>  		}
> -
> -		file->size = filp->f_path.dentry->d_inode->i_size;
> -
> -		/* allocate buffer even for empty files */
> -		file->buffer = vmalloc((file->size) ? file->size : 1);
> -		if (file->buffer == NULL) {
> -			rtcfg_unlockwr_proc(cmd->internal.data.ifindex);
> -			fput(filp);
> -			ret = -ENOMEM;
> -			goto err;
> -		}
> -
> -		oldfs = get_fs();
> -		set_fs(KERNEL_DS);
> -		filp->f_pos = 0;
> -
> -		ret = filp->f_op->read(filp, file->buffer, file->size,
> -				       &filp->f_pos);
> -
> -		set_fs(oldfs);
> -		fput(filp);
> -
> -		if (ret != (int)file->size) {
> -			rtcfg_unlockwr_proc(cmd->internal.data.ifindex);
> -			ret = -EIO;
> -			goto err;
> -		}
> -
> -		/* dispatch again, this time with new file attached */
> -		ret = rtpc_dispatch_call(rtcfg_event_handler, 0, cmd,
> -					 sizeof(*cmd), NULL, cleanup_cmd_add);
>  	}
>  
>  	return ret;
> 


-- 
Siemens AG, T RDA IOT
Corporate Competence Center Embedded Linux


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 3/7] cobalt/kernel: convert to proc_ops
  2021-04-07 10:17     ` Philippe Gerum
@ 2021-04-07 10:37       ` Jan Kiszka
  2021-04-07 11:03         ` Philippe Gerum
  0 siblings, 1 reply; 27+ messages in thread
From: Jan Kiszka @ 2021-04-07 10:37 UTC (permalink / raw)
  To: Philippe Gerum; +Cc: xenomai

On 07.04.21 12:17, Philippe Gerum wrote:
> 
> Jan Kiszka <jan.kiszka@siemens.com> writes:
> 
>> On 27.03.21 11:19, Philippe Gerum wrote:
>>> From: Philippe Gerum <rpm@xenomai.org>
>>>
>>> Signed-off-by: Philippe Gerum <rpm@xenomai.org>
>>> ---
>>>  .../include/asm-generic/xenomai/wrappers.h    | 20 ++++++
>>>  kernel/cobalt/vfile.c                         | 26 ++++----
>>>  kernel/drivers/analogy/device.c               | 12 ++--
>>>  kernel/drivers/can/mscan/rtcan_mscan_proc.c   | 12 ++--
>>>  kernel/drivers/can/rtcan_module.c             | 66 +++++++++----------
>>>  .../drivers/can/sja1000/rtcan_sja1000_proc.c  | 12 ++--
>>>  6 files changed, 80 insertions(+), 68 deletions(-)
>>>
>>> diff --git a/kernel/cobalt/include/asm-generic/xenomai/wrappers.h b/kernel/cobalt/include/asm-generic/xenomai/wrappers.h
>>> index cd22a8db5..cc0cb0896 100644
>>> --- a/kernel/cobalt/include/asm-generic/xenomai/wrappers.h
>>> +++ b/kernel/cobalt/include/asm-generic/xenomai/wrappers.h
>>> @@ -170,6 +170,26 @@ devm_hwmon_device_register_with_groups(struct device *dev, const char *name,
>>>  #define __kernel_old_timeval	timeval
>>>  #endif
>>>  
>>> +#if LINUX_VERSION_CODE < KERNEL_VERSION(5,6,0)
>>> +#define DEFINE_PROC_OPS(__name, __open, __release, __read, __write) \
>>> +	struct file_operations __name = {			    \
>>> +		.open = (__open),				    \
>>> +		.release = (__release),				    \
>>> +		.read = (__read),				    \
>>> +		.write = (__write),				    \
>>> +		.llseek = seq_lseek,				    \
>>> +}
>>> +#else
>>> +#define DEFINE_PROC_OPS(__name, __open, __release, __read, __write)	\
>>> +	struct proc_ops __name = {					\
>>> +		.proc_open = (__open),					\
>>> +		.proc_release = (__release),				\
>>> +		.proc_read = (__read),					\
>>> +		.proc_write = (__write),				\
>>> +		.proc_lseek = seq_lseek,				\
>>> +}
>>> +#endif
>>> +
>>>  #if LINUX_VERSION_CODE < KERNEL_VERSION(5,8,0)
>>>  #define old_timespec32    compat_timespec
>>>  #define old_itimerspec32  compat_itimerspec
>>> diff --git a/kernel/cobalt/vfile.c b/kernel/cobalt/vfile.c
>>> index f65d46ddf..e9e10ce8d 100644
>>> --- a/kernel/cobalt/vfile.c
>>> +++ b/kernel/cobalt/vfile.c
>>> @@ -340,13 +340,12 @@ ssize_t vfile_snapshot_write(struct file *file, const char __user *buf,
>>>  	return ret;
>>>  }
>>>  
>>> -static struct file_operations vfile_snapshot_fops = {
>>> -	.open = vfile_snapshot_open,
>>> -	.read = seq_read,
>>> -	.write = vfile_snapshot_write,
>>> -	.llseek = seq_lseek,
>>> -	.release = vfile_snapshot_release,
>>> -};
>>> +static const DEFINE_PROC_OPS(vfile_snapshot_fops,
>>> +		       vfile_snapshot_open,
>>> +		       vfile_snapshot_release,
>>> +		       seq_read,
>>> +		       vfile_snapshot_write
>>> +);
>>>  
>>>  /**
>>>   * @fn int xnvfile_init_snapshot(const char *name, struct xnvfile_snapshot *vfile, struct xnvfile_directory *parent)
>>> @@ -592,13 +591,12 @@ ssize_t vfile_regular_write(struct file *file, const char __user *buf,
>>>  	return ret;
>>>  }
>>>  
>>> -static struct file_operations vfile_regular_fops = {
>>> -	.open = vfile_regular_open,
>>> -	.read = seq_read,
>>> -	.write = vfile_regular_write,
>>> -	.llseek = seq_lseek,
>>> -	.release = vfile_regular_release,
>>> -};
>>> +static const DEFINE_PROC_OPS(vfile_regular_fops,
>>> +		       vfile_regular_open,
>>> +		       vfile_regular_release,
>>> +		       seq_read,
>>> +		       vfile_regular_write
>>> +);
>>>  
>>>  /**
>>>   * @fn int xnvfile_init_regular(const char *name, struct xnvfile_regular *vfile, struct xnvfile_directory *parent)
>>> diff --git a/kernel/drivers/analogy/device.c b/kernel/drivers/analogy/device.c
>>> index 160dcf547..6ed588708 100644
>>> --- a/kernel/drivers/analogy/device.c
>>> +++ b/kernel/drivers/analogy/device.c
>>> @@ -95,12 +95,12 @@ static int a4l_proc_transfer_open(struct inode *inode, struct file *file)
>>>  	return single_open(file, a4l_rdproc_transfer, PDE_DATA(inode));
>>>  }
>>>  
>>> -static const struct file_operations a4l_proc_transfer_ops = {
>>> -	.open		= a4l_proc_transfer_open,
>>> -	.read		= seq_read,
>>> -	.llseek		= seq_lseek,
>>> -	.release	= single_release,
>>> -};
>>> +static const DEFINE_PROC_OPS(a4l_proc_transfer_ops,
>>> +			     a4l_proc_transfer_open,
>>> +			     single_release,
>>> +			     seq_read,
>>> +			     NULL
>>> +);
>>>  
>>>  int a4l_proc_attach(struct a4l_device_context * cxt)
>>>  {
>>> diff --git a/kernel/drivers/can/mscan/rtcan_mscan_proc.c b/kernel/drivers/can/mscan/rtcan_mscan_proc.c
>>> index 6b54ad4c7..732a02765 100644
>>> --- a/kernel/drivers/can/mscan/rtcan_mscan_proc.c
>>> +++ b/kernel/drivers/can/mscan/rtcan_mscan_proc.c
>>> @@ -114,12 +114,12 @@ static int rtcan_mscan_proc_regs_open(struct inode *inode, struct file *file)
>>>  	return single_open(file, rtcan_mscan_proc_regs, PDE_DATA(inode));
>>>  }
>>>  
>>> -static const struct file_operations rtcan_mscan_proc_regs_ops = {
>>> -	.open		= rtcan_mscan_proc_regs_open,
>>> -	.read		= seq_read,
>>> -	.llseek		= seq_lseek,
>>> -	.release	= single_release,
>>> -};
>>> +static const DEFINE_PROC_OPS(rtcan_mscan_proc_regs_ops,
>>> +			rtcan_mscan_proc_regs_open,
>>> +			single_elease,
>>> +			seq_read,
>>> +			NULL
>>> +);
>>>  
>>>  int rtcan_mscan_create_proc(struct rtcan_device* dev)
>>>  {
>>> diff --git a/kernel/drivers/can/rtcan_module.c b/kernel/drivers/can/rtcan_module.c
>>> index fbc5c35e4..7f3d4c395 100644
>>> --- a/kernel/drivers/can/rtcan_module.c
>>> +++ b/kernel/drivers/can/rtcan_module.c
>>> @@ -157,12 +157,12 @@ static int rtcan_proc_devices_open(struct inode *inode, struct file *file)
>>>  	return single_open(file, rtcan_read_proc_devices, NULL);
>>>  }
>>>  
>>> -static const struct file_operations rtcan_proc_devices_ops = {
>>> -	.open		= rtcan_proc_devices_open,
>>> -	.read		= seq_read,
>>> -	.llseek		= seq_lseek,
>>> -	.release	= single_release,
>>> -};
>>> +static const DEFINE_PROC_OPS(rtcan_proc_devices_ops,
>>> +			rtcan_proc_devices_open,
>>> +			single_release,
>>> +			seq_read,
>>> +			NULL
>>> +);
>>>  
>>>  static int rtcan_read_proc_sockets(struct seq_file *p, void *data)
>>>  {
>>> @@ -220,13 +220,12 @@ static int rtcan_proc_sockets_open(struct inode *inode, struct file *file)
>>>  	return single_open(file, rtcan_read_proc_sockets, NULL);
>>>  }
>>>  
>>> -static const struct file_operations rtcan_proc_sockets_ops = {
>>> -	.open		= rtcan_proc_sockets_open,
>>> -	.read		= seq_read,
>>> -	.llseek		= seq_lseek,
>>> -	.release	= single_release,
>>> -};
>>> -
>>> +static const DEFINE_PROC_OPS(rtcan_proc_sockets_ops,
>>> +			rtcan_proc_sockets_open,
>>> +			single_release,
>>> +			seq_read,
>>> +			NULL
>>> +);
>>>  
>>>  static int rtcan_read_proc_info(struct seq_file *p, void *data)
>>>  {
>>> @@ -271,14 +270,12 @@ static int rtcan_proc_info_open(struct inode *inode, struct file *file)
>>>  	return single_open(file, rtcan_read_proc_info, PDE_DATA(inode));
>>>  }
>>>  
>>> -static const struct file_operations rtcan_proc_info_ops = {
>>> -	.open		= rtcan_proc_info_open,
>>> -	.read		= seq_read,
>>> -	.llseek		= seq_lseek,
>>> -	.release	= single_release,
>>> -};
>>> -
>>> -
>>> +static const DEFINE_PROC_OPS(rtcan_proc_info_ops,
>>> +			rtcan_proc_info_open,
>>> +			single_release,
>>> +			seq_read,
>>> +			NULL
>>> +);
>>>  
>>>  static int rtcan_read_proc_filter(struct seq_file *p, void *data)
>>>  {
>>> @@ -319,14 +316,12 @@ static int rtcan_proc_filter_open(struct inode *inode, struct file *file)
>>>  	return single_open(file, rtcan_read_proc_filter, PDE_DATA(inode));
>>>  }
>>>  
>>> -static const struct file_operations rtcan_proc_filter_ops = {
>>> -	.open		= rtcan_proc_filter_open,
>>> -	.read		= seq_read,
>>> -	.llseek		= seq_lseek,
>>> -	.release	= single_release,
>>> -};
>>> -
>>> -
>>> +static const DEFINE_PROC_OPS(rtcan_proc_filter_ops,
>>> +			rtcan_proc_filter_open,
>>> +			single_release,
>>> +			seq_read,
>>> +			NULL
>>> +);
>>>  
>>>  static int rtcan_read_proc_version(struct seq_file *p, void *data)
>>>  {
>>> @@ -341,13 +336,12 @@ static int rtcan_proc_version_open(struct inode *inode, struct file *file)
>>>  	return single_open(file, rtcan_read_proc_version, NULL);
>>>  }
>>>  
>>> -static const struct file_operations rtcan_proc_version_ops = {
>>> -	.open		= rtcan_proc_version_open,
>>> -	.read		= seq_read,
>>> -	.llseek		= seq_lseek,
>>> -	.release	= single_release,
>>> -};
>>> -
>>> +static const DEFINE_PROC_OPS(rtcan_proc_version_ops,
>>> +			rtcan_proc_version_open,
>>> +			single_release,
>>> +			seq_read,
>>> +			NULL
>>> +);
>>>  
>>>  void rtcan_dev_remove_proc(struct rtcan_device* dev)
>>>  {
>>> diff --git a/kernel/drivers/can/sja1000/rtcan_sja1000_proc.c b/kernel/drivers/can/sja1000/rtcan_sja1000_proc.c
>>> index b4af8ab2e..0fdee8c37 100644
>>> --- a/kernel/drivers/can/sja1000/rtcan_sja1000_proc.c
>>> +++ b/kernel/drivers/can/sja1000/rtcan_sja1000_proc.c
>>> @@ -50,12 +50,12 @@ static int rtcan_sja_proc_regs_open(struct inode *inode, struct file *file)
>>>  	return single_open(file, rtcan_sja_proc_regs, PDE_DATA(inode));
>>>  }
>>>  
>>> -static const struct file_operations rtcan_sja_proc_regs_ops = {
>>> -	.open		= rtcan_sja_proc_regs_open,
>>> -	.read		= seq_read,
>>> -	.llseek		= seq_lseek,
>>> -	.release	= single_release,
>>> -};
>>> +static const DEFINE_PROC_OPS(rtcan_sja_proc_regs_ops,
>>> +			rtcan_sja_proc_regs_open,
>>> +			single_release,
>>> +			seq_read,
>>> +			NULL
>>> +);
>>>  
>>>  int rtcan_sja_create_proc(struct rtcan_device* dev)
>>>  {
>>>
>>
>> This leaves out a4l_proc_devs_ops and a4l_proc_drvs_ops, while
>> converting a4l_proc_transfer_ops. By accident? In any case, I've added
>> those two cases to the patch.
>>
> 
> Not by accident, Analogy is off the map as far as I'm concerned. I would
> simply disable it from the CI stuff, until a patch dropping it entirely
> is pushed upstream.
> 

Well, touch it or leave it, but not half-convert it.

Jan

-- 
Siemens AG, T RDA IOT
Corporate Competence Center Embedded Linux


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 0/7] assorted v5.x related fixups
  2021-03-27 10:19 [PATCH v2 0/7] assorted v5.x related fixups Philippe Gerum
                   ` (6 preceding siblings ...)
  2021-03-27 10:19 ` [PATCH v2 7/7] drivers/net: cfg: fix config file load up Philippe Gerum
@ 2021-04-07 10:39 ` Jan Kiszka
  7 siblings, 0 replies; 27+ messages in thread
From: Jan Kiszka @ 2021-04-07 10:39 UTC (permalink / raw)
  To: Philippe Gerum, xenomai

On 27.03.21 11:19, Philippe Gerum wrote:
> From: Philippe Gerum <rpm@xenomai.org>
> 
> More wrappers and other changes required to build for the kernel v5.x
> series.
> 
> Philippe Gerum (7):
>   cobalt/memory: fix __vmalloc() calls
>   cobalt/debug: switch to mmap_lock interface
>   cobalt/kernel: convert to proc_ops
>   cobalt/debug: prefer dump_stack() to show_stack()
>   drivers/net: wrap csum_partial_copy_nocheck()
>   drivers/net: icmp: remove variable-length array
>   drivers/net: cfg: fix config file load up
> 
>  kernel/cobalt/debug.c                         |  6 +-
>  kernel/cobalt/heap.c                          |  3 +-
>  .../include/asm-generic/xenomai/wrappers.h    | 59 +++++++++++++++
>  kernel/cobalt/posix/memory.c                  |  7 +-
>  kernel/cobalt/vfile.c                         | 26 +++----
>  kernel/drivers/analogy/device.c               | 12 +--
>  kernel/drivers/can/mscan/rtcan_mscan_proc.c   | 12 +--
>  kernel/drivers/can/rtcan_module.c             | 66 ++++++++---------
>  .../drivers/can/sja1000/rtcan_sja1000_proc.c  | 12 +--
>  kernel/drivers/net/stack/ipv4/icmp.c          | 34 ++++++---
>  kernel/drivers/net/stack/rtcfg/rtcfg_ioctl.c  | 74 +++++++++----------
>  kernel/drivers/net/stack/rtskb.c              |  3 +-
>  12 files changed, 189 insertions(+), 125 deletions(-)
> 

Applied 1-4 and 6. 5 and 7 may need further work.

Thanks,
Jan

-- 
Siemens AG, T RDA IOT
Corporate Competence Center Embedded Linux


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 3/7] cobalt/kernel: convert to proc_ops
  2021-04-07 10:37       ` Jan Kiszka
@ 2021-04-07 11:03         ` Philippe Gerum
  0 siblings, 0 replies; 27+ messages in thread
From: Philippe Gerum @ 2021-04-07 11:03 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: xenomai


Jan Kiszka <jan.kiszka@siemens.com> writes:

> On 07.04.21 12:17, Philippe Gerum wrote:
>> 
>> Jan Kiszka <jan.kiszka@siemens.com> writes:
>> 
>>> On 27.03.21 11:19, Philippe Gerum wrote:
>>>> From: Philippe Gerum <rpm@xenomai.org>
>>>>
>>>> Signed-off-by: Philippe Gerum <rpm@xenomai.org>
>>>> ---
>>>>  .../include/asm-generic/xenomai/wrappers.h    | 20 ++++++
>>>>  kernel/cobalt/vfile.c                         | 26 ++++----
>>>>  kernel/drivers/analogy/device.c               | 12 ++--
>>>>  kernel/drivers/can/mscan/rtcan_mscan_proc.c   | 12 ++--
>>>>  kernel/drivers/can/rtcan_module.c             | 66 +++++++++----------
>>>>  .../drivers/can/sja1000/rtcan_sja1000_proc.c  | 12 ++--
>>>>  6 files changed, 80 insertions(+), 68 deletions(-)
>>>>
>>>> diff --git a/kernel/cobalt/include/asm-generic/xenomai/wrappers.h b/kernel/cobalt/include/asm-generic/xenomai/wrappers.h
>>>> index cd22a8db5..cc0cb0896 100644
>>>> --- a/kernel/cobalt/include/asm-generic/xenomai/wrappers.h
>>>> +++ b/kernel/cobalt/include/asm-generic/xenomai/wrappers.h
>>>> @@ -170,6 +170,26 @@ devm_hwmon_device_register_with_groups(struct device *dev, const char *name,
>>>>  #define __kernel_old_timeval	timeval
>>>>  #endif
>>>>  
>>>> +#if LINUX_VERSION_CODE < KERNEL_VERSION(5,6,0)
>>>> +#define DEFINE_PROC_OPS(__name, __open, __release, __read, __write) \
>>>> +	struct file_operations __name = {			    \
>>>> +		.open = (__open),				    \
>>>> +		.release = (__release),				    \
>>>> +		.read = (__read),				    \
>>>> +		.write = (__write),				    \
>>>> +		.llseek = seq_lseek,				    \
>>>> +}
>>>> +#else
>>>> +#define DEFINE_PROC_OPS(__name, __open, __release, __read, __write)	\
>>>> +	struct proc_ops __name = {					\
>>>> +		.proc_open = (__open),					\
>>>> +		.proc_release = (__release),				\
>>>> +		.proc_read = (__read),					\
>>>> +		.proc_write = (__write),				\
>>>> +		.proc_lseek = seq_lseek,				\
>>>> +}
>>>> +#endif
>>>> +
>>>>  #if LINUX_VERSION_CODE < KERNEL_VERSION(5,8,0)
>>>>  #define old_timespec32    compat_timespec
>>>>  #define old_itimerspec32  compat_itimerspec
>>>> diff --git a/kernel/cobalt/vfile.c b/kernel/cobalt/vfile.c
>>>> index f65d46ddf..e9e10ce8d 100644
>>>> --- a/kernel/cobalt/vfile.c
>>>> +++ b/kernel/cobalt/vfile.c
>>>> @@ -340,13 +340,12 @@ ssize_t vfile_snapshot_write(struct file *file, const char __user *buf,
>>>>  	return ret;
>>>>  }
>>>>  
>>>> -static struct file_operations vfile_snapshot_fops = {
>>>> -	.open = vfile_snapshot_open,
>>>> -	.read = seq_read,
>>>> -	.write = vfile_snapshot_write,
>>>> -	.llseek = seq_lseek,
>>>> -	.release = vfile_snapshot_release,
>>>> -};
>>>> +static const DEFINE_PROC_OPS(vfile_snapshot_fops,
>>>> +		       vfile_snapshot_open,
>>>> +		       vfile_snapshot_release,
>>>> +		       seq_read,
>>>> +		       vfile_snapshot_write
>>>> +);
>>>>  
>>>>  /**
>>>>   * @fn int xnvfile_init_snapshot(const char *name, struct xnvfile_snapshot *vfile, struct xnvfile_directory *parent)
>>>> @@ -592,13 +591,12 @@ ssize_t vfile_regular_write(struct file *file, const char __user *buf,
>>>>  	return ret;
>>>>  }
>>>>  
>>>> -static struct file_operations vfile_regular_fops = {
>>>> -	.open = vfile_regular_open,
>>>> -	.read = seq_read,
>>>> -	.write = vfile_regular_write,
>>>> -	.llseek = seq_lseek,
>>>> -	.release = vfile_regular_release,
>>>> -};
>>>> +static const DEFINE_PROC_OPS(vfile_regular_fops,
>>>> +		       vfile_regular_open,
>>>> +		       vfile_regular_release,
>>>> +		       seq_read,
>>>> +		       vfile_regular_write
>>>> +);
>>>>  
>>>>  /**
>>>>   * @fn int xnvfile_init_regular(const char *name, struct xnvfile_regular *vfile, struct xnvfile_directory *parent)
>>>> diff --git a/kernel/drivers/analogy/device.c b/kernel/drivers/analogy/device.c
>>>> index 160dcf547..6ed588708 100644
>>>> --- a/kernel/drivers/analogy/device.c
>>>> +++ b/kernel/drivers/analogy/device.c
>>>> @@ -95,12 +95,12 @@ static int a4l_proc_transfer_open(struct inode *inode, struct file *file)
>>>>  	return single_open(file, a4l_rdproc_transfer, PDE_DATA(inode));
>>>>  }
>>>>  
>>>> -static const struct file_operations a4l_proc_transfer_ops = {
>>>> -	.open		= a4l_proc_transfer_open,
>>>> -	.read		= seq_read,
>>>> -	.llseek		= seq_lseek,
>>>> -	.release	= single_release,
>>>> -};
>>>> +static const DEFINE_PROC_OPS(a4l_proc_transfer_ops,
>>>> +			     a4l_proc_transfer_open,
>>>> +			     single_release,
>>>> +			     seq_read,
>>>> +			     NULL
>>>> +);
>>>>  
>>>>  int a4l_proc_attach(struct a4l_device_context * cxt)
>>>>  {
>>>> diff --git a/kernel/drivers/can/mscan/rtcan_mscan_proc.c b/kernel/drivers/can/mscan/rtcan_mscan_proc.c
>>>> index 6b54ad4c7..732a02765 100644
>>>> --- a/kernel/drivers/can/mscan/rtcan_mscan_proc.c
>>>> +++ b/kernel/drivers/can/mscan/rtcan_mscan_proc.c
>>>> @@ -114,12 +114,12 @@ static int rtcan_mscan_proc_regs_open(struct inode *inode, struct file *file)
>>>>  	return single_open(file, rtcan_mscan_proc_regs, PDE_DATA(inode));
>>>>  }
>>>>  
>>>> -static const struct file_operations rtcan_mscan_proc_regs_ops = {
>>>> -	.open		= rtcan_mscan_proc_regs_open,
>>>> -	.read		= seq_read,
>>>> -	.llseek		= seq_lseek,
>>>> -	.release	= single_release,
>>>> -};
>>>> +static const DEFINE_PROC_OPS(rtcan_mscan_proc_regs_ops,
>>>> +			rtcan_mscan_proc_regs_open,
>>>> +			single_elease,
>>>> +			seq_read,
>>>> +			NULL
>>>> +);
>>>>  
>>>>  int rtcan_mscan_create_proc(struct rtcan_device* dev)
>>>>  {
>>>> diff --git a/kernel/drivers/can/rtcan_module.c b/kernel/drivers/can/rtcan_module.c
>>>> index fbc5c35e4..7f3d4c395 100644
>>>> --- a/kernel/drivers/can/rtcan_module.c
>>>> +++ b/kernel/drivers/can/rtcan_module.c
>>>> @@ -157,12 +157,12 @@ static int rtcan_proc_devices_open(struct inode *inode, struct file *file)
>>>>  	return single_open(file, rtcan_read_proc_devices, NULL);
>>>>  }
>>>>  
>>>> -static const struct file_operations rtcan_proc_devices_ops = {
>>>> -	.open		= rtcan_proc_devices_open,
>>>> -	.read		= seq_read,
>>>> -	.llseek		= seq_lseek,
>>>> -	.release	= single_release,
>>>> -};
>>>> +static const DEFINE_PROC_OPS(rtcan_proc_devices_ops,
>>>> +			rtcan_proc_devices_open,
>>>> +			single_release,
>>>> +			seq_read,
>>>> +			NULL
>>>> +);
>>>>  
>>>>  static int rtcan_read_proc_sockets(struct seq_file *p, void *data)
>>>>  {
>>>> @@ -220,13 +220,12 @@ static int rtcan_proc_sockets_open(struct inode *inode, struct file *file)
>>>>  	return single_open(file, rtcan_read_proc_sockets, NULL);
>>>>  }
>>>>  
>>>> -static const struct file_operations rtcan_proc_sockets_ops = {
>>>> -	.open		= rtcan_proc_sockets_open,
>>>> -	.read		= seq_read,
>>>> -	.llseek		= seq_lseek,
>>>> -	.release	= single_release,
>>>> -};
>>>> -
>>>> +static const DEFINE_PROC_OPS(rtcan_proc_sockets_ops,
>>>> +			rtcan_proc_sockets_open,
>>>> +			single_release,
>>>> +			seq_read,
>>>> +			NULL
>>>> +);
>>>>  
>>>>  static int rtcan_read_proc_info(struct seq_file *p, void *data)
>>>>  {
>>>> @@ -271,14 +270,12 @@ static int rtcan_proc_info_open(struct inode *inode, struct file *file)
>>>>  	return single_open(file, rtcan_read_proc_info, PDE_DATA(inode));
>>>>  }
>>>>  
>>>> -static const struct file_operations rtcan_proc_info_ops = {
>>>> -	.open		= rtcan_proc_info_open,
>>>> -	.read		= seq_read,
>>>> -	.llseek		= seq_lseek,
>>>> -	.release	= single_release,
>>>> -};
>>>> -
>>>> -
>>>> +static const DEFINE_PROC_OPS(rtcan_proc_info_ops,
>>>> +			rtcan_proc_info_open,
>>>> +			single_release,
>>>> +			seq_read,
>>>> +			NULL
>>>> +);
>>>>  
>>>>  static int rtcan_read_proc_filter(struct seq_file *p, void *data)
>>>>  {
>>>> @@ -319,14 +316,12 @@ static int rtcan_proc_filter_open(struct inode *inode, struct file *file)
>>>>  	return single_open(file, rtcan_read_proc_filter, PDE_DATA(inode));
>>>>  }
>>>>  
>>>> -static const struct file_operations rtcan_proc_filter_ops = {
>>>> -	.open		= rtcan_proc_filter_open,
>>>> -	.read		= seq_read,
>>>> -	.llseek		= seq_lseek,
>>>> -	.release	= single_release,
>>>> -};
>>>> -
>>>> -
>>>> +static const DEFINE_PROC_OPS(rtcan_proc_filter_ops,
>>>> +			rtcan_proc_filter_open,
>>>> +			single_release,
>>>> +			seq_read,
>>>> +			NULL
>>>> +);
>>>>  
>>>>  static int rtcan_read_proc_version(struct seq_file *p, void *data)
>>>>  {
>>>> @@ -341,13 +336,12 @@ static int rtcan_proc_version_open(struct inode *inode, struct file *file)
>>>>  	return single_open(file, rtcan_read_proc_version, NULL);
>>>>  }
>>>>  
>>>> -static const struct file_operations rtcan_proc_version_ops = {
>>>> -	.open		= rtcan_proc_version_open,
>>>> -	.read		= seq_read,
>>>> -	.llseek		= seq_lseek,
>>>> -	.release	= single_release,
>>>> -};
>>>> -
>>>> +static const DEFINE_PROC_OPS(rtcan_proc_version_ops,
>>>> +			rtcan_proc_version_open,
>>>> +			single_release,
>>>> +			seq_read,
>>>> +			NULL
>>>> +);
>>>>  
>>>>  void rtcan_dev_remove_proc(struct rtcan_device* dev)
>>>>  {
>>>> diff --git a/kernel/drivers/can/sja1000/rtcan_sja1000_proc.c b/kernel/drivers/can/sja1000/rtcan_sja1000_proc.c
>>>> index b4af8ab2e..0fdee8c37 100644
>>>> --- a/kernel/drivers/can/sja1000/rtcan_sja1000_proc.c
>>>> +++ b/kernel/drivers/can/sja1000/rtcan_sja1000_proc.c
>>>> @@ -50,12 +50,12 @@ static int rtcan_sja_proc_regs_open(struct inode *inode, struct file *file)
>>>>  	return single_open(file, rtcan_sja_proc_regs, PDE_DATA(inode));
>>>>  }
>>>>  
>>>> -static const struct file_operations rtcan_sja_proc_regs_ops = {
>>>> -	.open		= rtcan_sja_proc_regs_open,
>>>> -	.read		= seq_read,
>>>> -	.llseek		= seq_lseek,
>>>> -	.release	= single_release,
>>>> -};
>>>> +static const DEFINE_PROC_OPS(rtcan_sja_proc_regs_ops,
>>>> +			rtcan_sja_proc_regs_open,
>>>> +			single_release,
>>>> +			seq_read,
>>>> +			NULL
>>>> +);
>>>>  
>>>>  int rtcan_sja_create_proc(struct rtcan_device* dev)
>>>>  {
>>>>
>>>
>>> This leaves out a4l_proc_devs_ops and a4l_proc_drvs_ops, while
>>> converting a4l_proc_transfer_ops. By accident? In any case, I've added
>>> those two cases to the patch.
>>>
>> 
>> Not by accident, Analogy is off the map as far as I'm concerned. I would
>> simply disable it from the CI stuff, until a patch dropping it entirely
>> is pushed upstream.
>> 
>
> Well, touch it or leave it, but not half-convert it.
>

Then let's not fix the unfixable at all. Please drop all changes, then
issue a patch removing Analogy entirely.

-- 
Philippe.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 5/7] drivers/net: wrap csum_partial_copy_nocheck()
  2021-04-07 10:06   ` Jan Kiszka
@ 2021-04-15  7:21     ` Philippe Gerum
  2021-04-15  7:35       ` Jan Kiszka
  0 siblings, 1 reply; 27+ messages in thread
From: Philippe Gerum @ 2021-04-15  7:21 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: xenomai


Jan Kiszka <jan.kiszka@siemens.com> writes:

> On 27.03.21 11:19, Philippe Gerum wrote:
>> From: Philippe Gerum <rpm@xenomai.org>
>> 
>> Since v5.9-rc1, csum_partial_copy_nocheck() forces a zero seed as its
>> last argument to csum_partial(). According to #cc44c17baf7f3, passing
>> a non-zero value would not even yield the proper result on some
>> architectures.
>> 
>> Nevertheless, the current ICMP code does expect a non-zero csum seed
>> to be used in the next computation, so let's wrap net_csum_copy() to
>> csum_partial_copy_nocheck() for pre-5.9 kernels, and open code it for
>> later kernels so that we still feed csum_partial() with the user-given
>> csum. We still expect the x86, ARM and arm64 implementations of
>> csum_partial() to do the right thing.
>> 
>
> If that issue only affects the ICMP code path, why not only changing
> that, leaving rtskb_copy_and_csum_bits with the benefit of doing copy
> and csum in one step?
>

As a result of #cc44c17baf7f3, I see no common helper available from the
kernel folding the copy and checksum operations anymore, so I see no way
to keep rtskb_copy_and_csum_bits() as is. Did you find an all-in-one
replacement for csum_partial_copy_nocheck() which would allow a csum
value to be fed in?

-- 
Philippe.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 5/7] drivers/net: wrap csum_partial_copy_nocheck()
  2021-04-15  7:21     ` Philippe Gerum
@ 2021-04-15  7:35       ` Jan Kiszka
  2021-04-15  7:54         ` Philippe Gerum
  0 siblings, 1 reply; 27+ messages in thread
From: Jan Kiszka @ 2021-04-15  7:35 UTC (permalink / raw)
  To: Philippe Gerum; +Cc: xenomai

On 15.04.21 09:21, Philippe Gerum wrote:
> 
> Jan Kiszka <jan.kiszka@siemens.com> writes:
> 
>> On 27.03.21 11:19, Philippe Gerum wrote:
>>> From: Philippe Gerum <rpm@xenomai.org>
>>>
>>> Since v5.9-rc1, csum_partial_copy_nocheck() forces a zero seed as its
>>> last argument to csum_partial(). According to #cc44c17baf7f3, passing
>>> a non-zero value would not even yield the proper result on some
>>> architectures.
>>>
>>> Nevertheless, the current ICMP code does expect a non-zero csum seed
>>> to be used in the next computation, so let's wrap net_csum_copy() to
>>> csum_partial_copy_nocheck() for pre-5.9 kernels, and open code it for
>>> later kernels so that we still feed csum_partial() with the user-given
>>> csum. We still expect the x86, ARM and arm64 implementations of
>>> csum_partial() to do the right thing.
>>>
>>
>> If that issue only affects the ICMP code path, why not only changing
>> that, leaving rtskb_copy_and_csum_bits with the benefit of doing copy
>> and csum in one step?
>>
> 
> As a result of #cc44c17baf7f3, I see no common helper available from the
> kernel folding the copy and checksum operations anymore, so I see no way
> to keep rtskb_copy_and_csum_bits() as is. Did you find an all-in-one
> replacement for csum_partial_copy_nocheck() which would allow a csum
> value to be fed in?
> 

rtskb_copy_and_csum_dev does not need that.

Jan

-- 
Siemens AG, T RDA IOT
Corporate Competence Center Embedded Linux


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 5/7] drivers/net: wrap csum_partial_copy_nocheck()
  2021-04-15  7:35       ` Jan Kiszka
@ 2021-04-15  7:54         ` Philippe Gerum
  2021-04-15  8:10           ` Jan Kiszka
  0 siblings, 1 reply; 27+ messages in thread
From: Philippe Gerum @ 2021-04-15  7:54 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: xenomai


Jan Kiszka <jan.kiszka@siemens.com> writes:

> On 15.04.21 09:21, Philippe Gerum wrote:
>> 
>> Jan Kiszka <jan.kiszka@siemens.com> writes:
>> 
>>> On 27.03.21 11:19, Philippe Gerum wrote:
>>>> From: Philippe Gerum <rpm@xenomai.org>
>>>>
>>>> Since v5.9-rc1, csum_partial_copy_nocheck() forces a zero seed as its
>>>> last argument to csum_partial(). According to #cc44c17baf7f3, passing
>>>> a non-zero value would not even yield the proper result on some
>>>> architectures.
>>>>
>>>> Nevertheless, the current ICMP code does expect a non-zero csum seed
>>>> to be used in the next computation, so let's wrap net_csum_copy() to
>>>> csum_partial_copy_nocheck() for pre-5.9 kernels, and open code it for
>>>> later kernels so that we still feed csum_partial() with the user-given
>>>> csum. We still expect the x86, ARM and arm64 implementations of
>>>> csum_partial() to do the right thing.
>>>>
>>>
>>> If that issue only affects the ICMP code path, why not only changing
>>> that, leaving rtskb_copy_and_csum_bits with the benefit of doing copy
>>> and csum in one step?
>>>
>> 
>> As a result of #cc44c17baf7f3, I see no common helper available from the
>> kernel folding the copy and checksum operations anymore, so I see no way
>> to keep rtskb_copy_and_csum_bits() as is. Did you find an all-in-one
>> replacement for csum_partial_copy_nocheck() which would allow a csum
>> value to be fed in?
>> 
>
> rtskb_copy_and_csum_dev does not need that.
>

You made rtskb_copy_and_csum_bits() part of the exported API. So how do
you want to deal with this?

-- 
Philippe.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 5/7] drivers/net: wrap csum_partial_copy_nocheck()
  2021-04-15  7:54         ` Philippe Gerum
@ 2021-04-15  8:10           ` Jan Kiszka
  2021-04-16 16:48             ` Philippe Gerum
  0 siblings, 1 reply; 27+ messages in thread
From: Jan Kiszka @ 2021-04-15  8:10 UTC (permalink / raw)
  To: Philippe Gerum; +Cc: xenomai

On 15.04.21 09:54, Philippe Gerum wrote:
> 
> Jan Kiszka <jan.kiszka@siemens.com> writes:
> 
>> On 15.04.21 09:21, Philippe Gerum wrote:
>>>
>>> Jan Kiszka <jan.kiszka@siemens.com> writes:
>>>
>>>> On 27.03.21 11:19, Philippe Gerum wrote:
>>>>> From: Philippe Gerum <rpm@xenomai.org>
>>>>>
>>>>> Since v5.9-rc1, csum_partial_copy_nocheck() forces a zero seed as its
>>>>> last argument to csum_partial(). According to #cc44c17baf7f3, passing
>>>>> a non-zero value would not even yield the proper result on some
>>>>> architectures.
>>>>>
>>>>> Nevertheless, the current ICMP code does expect a non-zero csum seed
>>>>> to be used in the next computation, so let's wrap net_csum_copy() to
>>>>> csum_partial_copy_nocheck() for pre-5.9 kernels, and open code it for
>>>>> later kernels so that we still feed csum_partial() with the user-given
>>>>> csum. We still expect the x86, ARM and arm64 implementations of
>>>>> csum_partial() to do the right thing.
>>>>>
>>>>
>>>> If that issue only affects the ICMP code path, why not only changing
>>>> that, leaving rtskb_copy_and_csum_bits with the benefit of doing copy
>>>> and csum in one step?
>>>>
>>>
>>> As a result of #cc44c17baf7f3, I see no common helper available from the
>>> kernel folding the copy and checksum operations anymore, so I see no way
>>> to keep rtskb_copy_and_csum_bits() as is. Did you find an all-in-one
>>> replacement for csum_partial_copy_nocheck() which would allow a csum
>>> value to be fed in?
>>>
>>
>> rtskb_copy_and_csum_dev does not need that.
>>
> 
> You made rtskb_copy_and_csum_bits() part of the exported API. So how do
> you want to deal with this?
> 

That is an internal API, so we don't care.

But even if we converted rtskb_copy_and_csum_bits to work as it used to
(with a csum != 0), there would be not reason to make
rtskb_copy_and_csum_dev pay that price.

Jan

-- 
Siemens AG, T RDA IOT
Corporate Competence Center Embedded Linux


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 5/7] drivers/net: wrap csum_partial_copy_nocheck()
  2021-04-15  8:10           ` Jan Kiszka
@ 2021-04-16 16:48             ` Philippe Gerum
  2021-04-16 17:12               ` Jan Kiszka
  0 siblings, 1 reply; 27+ messages in thread
From: Philippe Gerum @ 2021-04-16 16:48 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: xenomai


Jan Kiszka <jan.kiszka@siemens.com> writes:

> On 15.04.21 09:54, Philippe Gerum wrote:
>> 
>> Jan Kiszka <jan.kiszka@siemens.com> writes:
>> 
>>> On 15.04.21 09:21, Philippe Gerum wrote:
>>>>
>>>> Jan Kiszka <jan.kiszka@siemens.com> writes:
>>>>
>>>>> On 27.03.21 11:19, Philippe Gerum wrote:
>>>>>> From: Philippe Gerum <rpm@xenomai.org>
>>>>>>
>>>>>> Since v5.9-rc1, csum_partial_copy_nocheck() forces a zero seed as its
>>>>>> last argument to csum_partial(). According to #cc44c17baf7f3, passing
>>>>>> a non-zero value would not even yield the proper result on some
>>>>>> architectures.
>>>>>>
>>>>>> Nevertheless, the current ICMP code does expect a non-zero csum seed
>>>>>> to be used in the next computation, so let's wrap net_csum_copy() to
>>>>>> csum_partial_copy_nocheck() for pre-5.9 kernels, and open code it for
>>>>>> later kernels so that we still feed csum_partial() with the user-given
>>>>>> csum. We still expect the x86, ARM and arm64 implementations of
>>>>>> csum_partial() to do the right thing.
>>>>>>
>>>>>
>>>>> If that issue only affects the ICMP code path, why not only changing
>>>>> that, leaving rtskb_copy_and_csum_bits with the benefit of doing copy
>>>>> and csum in one step?
>>>>>
>>>>
>>>> As a result of #cc44c17baf7f3, I see no common helper available from the
>>>> kernel folding the copy and checksum operations anymore, so I see no way
>>>> to keep rtskb_copy_and_csum_bits() as is. Did you find an all-in-one
>>>> replacement for csum_partial_copy_nocheck() which would allow a csum
>>>> value to be fed in?
>>>>
>>>
>>> rtskb_copy_and_csum_dev does not need that.
>>>
>> 
>> You made rtskb_copy_and_csum_bits() part of the exported API. So how do
>> you want to deal with this?
>> 
>
> That is an internal API, so we don't care.
>
> But even if we converted rtskb_copy_and_csum_bits to work as it used to
> (with a csum != 0), there would be not reason to make
> rtskb_copy_and_csum_dev pay that price.
>

Well, there may be a reason to challenge the idea that a folded
copy_and_csum is better for a real-time system than a split memcpy+csum
in the first place. Out of curiosity, I ran a few benchmarks lately on a
few SoCs regarding this, and it turned out that optimizing the data copy
to get the buffer quickly in place before checksumming the result may
well yield much better results with respect to jitter than what
csum_and_copy currently achieves on these SoCs.

Inline csum_and_copy did perform slightly better on average (a couple of
hundreds of nanosecs at best) but with pathological jittery in the worst
case at times. On the contrary, the split memcpy+csum method did not
exhibit such jittery during these tests, not once.

- four SoCs tested (2 x x86, armv7, armv8a)
- test code ran in kernel space (real-time task context,
  out-of-band/primary context)
- csum_partial_copy_nocheck() vs memcpy()+csum_partial()
- 3 buffer sizes tested (32, 1024, 1500 bytes), 3 runs each
- all buffers (src & dst) aligned on L1_CACHE_BYTES
- each run performed 1000,000 iterations of a given checksum loop, no
  pause in between.
- no concurrent load on the board
- all results in nanoseconds

The worst results obtained are presented here for each SoC.

x86[1]
------

vendor_id	: GenuineIntel
cpu family	: 6
model		: 92
model name	: Intel(R) Atom(TM) Processor E3940 @ 1.60GHz
stepping	: 9
cpu MHz		: 1593.600
cache size	: 1024 KB
cpuid level	: 21
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology tsc_reliable nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 ds_cpl vmx est tm2 ssse3 sdbg cx16 xtpr pdcm sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave rdrand lahf_lm 3dnowprefetch cpuid_fault cat_l2 pti tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust smep erms mpx rdt_a rdseed smap clflushopt intel_pt sha_ni xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts
vmx flags	: vnmi preemption_timer posted_intr invvpid ept_x_only ept_ad ept_1gb flexpriority apicv tsc_offset vtpr mtf vapic ept vpid unrestricted_guest vapic_reg vid ple shadow_vmcs

==

CSUM_COPY 32b: min=68, max=653, avg=70
CSUM_COPY 1024b: min=248, max=373, avg=251
CSUM_COPY 1500b: min=344, max=3123, avg=350   <=================
COPY+CSUM 32b: min=101, max=790, avg=103
COPY+CSUM 1024b: min=297, max=397, avg=300
COPY+CSUM 1500b: min=402, max=490, avg=405

==

CSUM_COPY 32b: min=68, max=1420, avg=70
CSUM_COPY 1024b: min=248, max=29706, avg=251   <=================
CSUM_COPY 1500b: min=344, max=792, avg=350
COPY+CSUM 32b: min=101, max=872, avg=103
COPY+CSUM 1024b: min=297, max=776, avg=300
COPY+CSUM 1500b: min=402, max=853, avg=405

==

CSUM_COPY 32b: min=68, max=661, avg=70
CSUM_COPY 1024b: min=248, max=1714, avg=251
CSUM_COPY 1500b: min=344, max=33937, avg=350   <=================
COPY+CSUM 32b: min=101, max=610, avg=103
COPY+CSUM 1024b: min=297, max=605, avg=300
COPY+CSUM 1500b: min=402, max=712, avg=405

x86[2]
------

vendor_id       : GenuineIntel
cpu family      : 6
model           : 23
model name      : Intel(R) Core(TM)2 Duo CPU     E7200  @ 2.53GHz
stepping        : 6
microcode       : 0x60c
cpu MHz         : 1627.113
cache size      : 3072 KB
cpuid level     : 10
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good nopl cpuid aperfmperf pni dtes64 monitor ds_cpl est tm2 ssse3 cx16 xtpr pdcm sse4_1 lahf_lm pti dtherm

CSUM_COPY 32b: min=558, max=31010, avg=674     <=================
CSUM_COPY 1024b: min=558, max=2794, avg=745
CSUM_COPY 1500b: min=558, max=2794, avg=841
COPY+CSUM 32b: min=558, max=2794, avg=671
COPY+CSUM 1024b: min=558, max=2794, avg=778
COPY+CSUM 1500b: min=838, max=2794, avg=865

==

CSUM_COPY 32b: min=59, max=532, avg=62
CSUM_COPY 1024b: min=198, max=270, avg=201
CSUM_COPY 1500b: min=288, max=34921, avg=289   <=================
COPY+CSUM 32b: min=53, max=326, avg=56
COPY+CSUM 1024b: min=228, max=461, avg=232
COPY+CSUM 1500b: min=311, max=341, avg=317

==

CSUM_COPY 32b: min=59, max=382, avg=62
CSUM_COPY 1024b: min=198, max=383, avg=201
CSUM_COPY 1500b: min=285, max=21235, avg=289   <=================
COPY+CSUM 32b: min=52, max=300, avg=56
COPY+CSUM 1024b: min=228, max=348, avg=232
COPY+CSUM 1500b: min=311, max=409, avg=317

Cortex A9 quad-core 1.2Ghz (iMX6qp)
-----------------------------------

model name	: ARMv7 Processor rev 10 (v7l)
Features	: half thumb fastmult vfp edsp neon vfpv3 tls vfpd32 
CPU implementer	: 0x41
CPU architecture: 7
CPU variant	: 0x2
CPU part	: 0xc09
CPU revision	: 10

CSUM_COPY 32b: min=333, max=1334, avg=440
CSUM_COPY 1024b: min=1000, max=2666, avg=1060
CSUM_COPY 1500b: min=1333, max=45333, avg=1357   <=================
COPY+CSUM 32b: min=333, max=1334, avg=476
COPY+CSUM 1024b: min=1000, max=2333, avg=1324
COPY+CSUM 1500b: min=1666, max=2334, avg=1713

==

CSUM_COPY 32b: min=333, max=1334, avg=439
CSUM_COPY 1024b: min=1000, max=46000, avg=1060   <=================
CSUM_COPY 1500b: min=1333, max=5000, avg=1351
COPY+CSUM 32b: min=333, max=1334, avg=476
COPY+CSUM 1024b: min=1000, max=2334, avg=1324
COPY+CSUM 1500b: min=1666, max=2667, avg=1713

==

CSUM_COPY 32b: min=333, max=1666, avg=454
CSUM_COPY 1024b: min=1000, max=2000, avg=1060
CSUM_COPY 1500b: min=1333, max=45000, avg=1348   <=================
COPY+CSUM 32b: min=333, max=1334, avg=454
COPY+CSUM 1024b: min=1000, max=2334, avg=1317
COPY+CSUM 1500b: min=1666, max=6000, avg=1712

Cortex A55 quad-core 2Ghz (Odroid C4)
-------------------------------------

Features        : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp
CPU implementer : 0x41
CPU architecture: 8
CPU variant     : 0x1
CPU part        : 0xd05
CPU revision    : 0


CSUM_COPY 32b: min=125, max=833, avg=140
CSUM_COPY 1024b: min=625, max=41916, avg=673   <=================
CSUM_COPY 1500b: min=875, max=3875, avg=923
COPY+CSUM 32b: min=125, max=458, avg=140
COPY+CSUM 1024b: min=625, max=1166, avg=666
COPY+CSUM 1500b: min=875, max=1167, avg=913

==

CSUM_COPY 32b: min=125, max=1292, avg=139
CSUM_COPY 1024b: min=541, max=48333, avg=555
CSUM_COPY 1500b: min=708, max=3458, avg=740
COPY+CSUM 32b: min=125, max=292, avg=136
COPY+CSUM 1024b: min=541, max=750, avg=556
COPY+CSUM 1500b: min=708, max=834, avg=740

==

CSUM_COPY 32b: min=125, max=833, avg=140
CSUM_COPY 1024b: min=666, max=55667, avg=673   <=================
CSUM_COPY 1500b: min=875, max=4208, avg=913
COPY+CSUM 32b: min=125, max=375, avg=140
COPY+CSUM 1024b: min=666, max=916, avg=673
COPY+CSUM 1500b: min=875, max=1042, avg=913

============

A few additional observations from looking at the implementation:

For memcpy, legacy x86[2] uses movsq, finishing with movsb to complete
buffers of unaligned length. Current x86[1] uses ERMS-optimized movsb
which is faster.

arm32/armv7 optimizes memcpy by loading up to 8 words in a single
instruction. csum_and_copy loads/stores at best 4 words at a time,
only when src and dst are 32bit aligned (which matches the test case).

arm64/armv8a uses load/store pair instructions to copy memory
blocks. It does not have asm-optimized csum_and_copy support, so it
uses the generic C version.

What could be inferred in terms of prefetching and speculation might
explain some differences between the approaches too.

I would be interested in any converging / diverging results testing the
same combo with a different test code, because from my standpoint,
things do not seem as obvious as they are supposed to be at the moment.

-- 
Philippe.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 5/7] drivers/net: wrap csum_partial_copy_nocheck()
  2021-04-16 16:48             ` Philippe Gerum
@ 2021-04-16 17:12               ` Jan Kiszka
  2021-04-18  9:18                 ` Philippe Gerum
  0 siblings, 1 reply; 27+ messages in thread
From: Jan Kiszka @ 2021-04-16 17:12 UTC (permalink / raw)
  To: Philippe Gerum; +Cc: xenomai

On 16.04.21 18:48, Philippe Gerum wrote:
> 
> Jan Kiszka <jan.kiszka@siemens.com> writes:
> 
>> On 15.04.21 09:54, Philippe Gerum wrote:
>>>
>>> Jan Kiszka <jan.kiszka@siemens.com> writes:
>>>
>>>> On 15.04.21 09:21, Philippe Gerum wrote:
>>>>>
>>>>> Jan Kiszka <jan.kiszka@siemens.com> writes:
>>>>>
>>>>>> On 27.03.21 11:19, Philippe Gerum wrote:
>>>>>>> From: Philippe Gerum <rpm@xenomai.org>
>>>>>>>
>>>>>>> Since v5.9-rc1, csum_partial_copy_nocheck() forces a zero seed as its
>>>>>>> last argument to csum_partial(). According to #cc44c17baf7f3, passing
>>>>>>> a non-zero value would not even yield the proper result on some
>>>>>>> architectures.
>>>>>>>
>>>>>>> Nevertheless, the current ICMP code does expect a non-zero csum seed
>>>>>>> to be used in the next computation, so let's wrap net_csum_copy() to
>>>>>>> csum_partial_copy_nocheck() for pre-5.9 kernels, and open code it for
>>>>>>> later kernels so that we still feed csum_partial() with the user-given
>>>>>>> csum. We still expect the x86, ARM and arm64 implementations of
>>>>>>> csum_partial() to do the right thing.
>>>>>>>
>>>>>>
>>>>>> If that issue only affects the ICMP code path, why not only changing
>>>>>> that, leaving rtskb_copy_and_csum_bits with the benefit of doing copy
>>>>>> and csum in one step?
>>>>>>
>>>>>
>>>>> As a result of #cc44c17baf7f3, I see no common helper available from the
>>>>> kernel folding the copy and checksum operations anymore, so I see no way
>>>>> to keep rtskb_copy_and_csum_bits() as is. Did you find an all-in-one
>>>>> replacement for csum_partial_copy_nocheck() which would allow a csum
>>>>> value to be fed in?
>>>>>
>>>>
>>>> rtskb_copy_and_csum_dev does not need that.
>>>>
>>>
>>> You made rtskb_copy_and_csum_bits() part of the exported API. So how do
>>> you want to deal with this?
>>>
>>
>> That is an internal API, so we don't care.
>>
>> But even if we converted rtskb_copy_and_csum_bits to work as it used to
>> (with a csum != 0), there would be not reason to make
>> rtskb_copy_and_csum_dev pay that price.
>>
> 
> Well, there may be a reason to challenge the idea that a folded
> copy_and_csum is better for a real-time system than a split memcpy+csum
> in the first place. Out of curiosity, I ran a few benchmarks lately on a
> few SoCs regarding this, and it turned out that optimizing the data copy
> to get the buffer quickly in place before checksumming the result may
> well yield much better results with respect to jitter than what
> csum_and_copy currently achieves on these SoCs.
> 
> Inline csum_and_copy did perform slightly better on average (a couple of
> hundreds of nanosecs at best) but with pathological jittery in the worst
> case at times. On the contrary, the split memcpy+csum method did not
> exhibit such jittery during these tests, not once.
> 
> - four SoCs tested (2 x x86, armv7, armv8a)
> - test code ran in kernel space (real-time task context,
>   out-of-band/primary context)
> - csum_partial_copy_nocheck() vs memcpy()+csum_partial()
> - 3 buffer sizes tested (32, 1024, 1500 bytes), 3 runs each
> - all buffers (src & dst) aligned on L1_CACHE_BYTES
> - each run performed 1000,000 iterations of a given checksum loop, no
>   pause in between.
> - no concurrent load on the board
> - all results in nanoseconds
> 
> The worst results obtained are presented here for each SoC.
> 
> x86[1]
> ------
> 
> vendor_id	: GenuineIntel
> cpu family	: 6
> model		: 92
> model name	: Intel(R) Atom(TM) Processor E3940 @ 1.60GHz
> stepping	: 9
> cpu MHz		: 1593.600
> cache size	: 1024 KB
> cpuid level	: 21
> wp		: yes
> flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology tsc_reliable nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 ds_cpl vmx est tm2 ssse3 sdbg cx16 xtpr pdcm sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave rdrand lahf_lm 3dnowprefetch cpuid_fault cat_l2 pti tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust smep erms mpx rdt_a rdseed smap clflushopt intel_pt sha_ni xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts
> vmx flags	: vnmi preemption_timer posted_intr invvpid ept_x_only ept_ad ept_1gb flexpriority apicv tsc_offset vtpr mtf vapic ept vpid unrestricted_guest vapic_reg vid ple shadow_vmcs
> 
> ==
> 
> CSUM_COPY 32b: min=68, max=653, avg=70
> CSUM_COPY 1024b: min=248, max=373, avg=251
> CSUM_COPY 1500b: min=344, max=3123, avg=350   <=================
> COPY+CSUM 32b: min=101, max=790, avg=103
> COPY+CSUM 1024b: min=297, max=397, avg=300
> COPY+CSUM 1500b: min=402, max=490, avg=405
> 
> ==
> 
> CSUM_COPY 32b: min=68, max=1420, avg=70
> CSUM_COPY 1024b: min=248, max=29706, avg=251   <=================
> CSUM_COPY 1500b: min=344, max=792, avg=350
> COPY+CSUM 32b: min=101, max=872, avg=103
> COPY+CSUM 1024b: min=297, max=776, avg=300
> COPY+CSUM 1500b: min=402, max=853, avg=405
> 
> ==
> 
> CSUM_COPY 32b: min=68, max=661, avg=70
> CSUM_COPY 1024b: min=248, max=1714, avg=251
> CSUM_COPY 1500b: min=344, max=33937, avg=350   <=================
> COPY+CSUM 32b: min=101, max=610, avg=103
> COPY+CSUM 1024b: min=297, max=605, avg=300
> COPY+CSUM 1500b: min=402, max=712, avg=405
> 
> x86[2]
> ------
> 
> vendor_id       : GenuineIntel
> cpu family      : 6
> model           : 23
> model name      : Intel(R) Core(TM)2 Duo CPU     E7200  @ 2.53GHz
> stepping        : 6
> microcode       : 0x60c
> cpu MHz         : 1627.113
> cache size      : 3072 KB
> cpuid level     : 10
> wp              : yes
> flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good nopl cpuid aperfmperf pni dtes64 monitor ds_cpl est tm2 ssse3 cx16 xtpr pdcm sse4_1 lahf_lm pti dtherm
> 
> CSUM_COPY 32b: min=558, max=31010, avg=674     <=================
> CSUM_COPY 1024b: min=558, max=2794, avg=745
> CSUM_COPY 1500b: min=558, max=2794, avg=841
> COPY+CSUM 32b: min=558, max=2794, avg=671
> COPY+CSUM 1024b: min=558, max=2794, avg=778
> COPY+CSUM 1500b: min=838, max=2794, avg=865
> 
> ==
> 
> CSUM_COPY 32b: min=59, max=532, avg=62
> CSUM_COPY 1024b: min=198, max=270, avg=201
> CSUM_COPY 1500b: min=288, max=34921, avg=289   <=================
> COPY+CSUM 32b: min=53, max=326, avg=56
> COPY+CSUM 1024b: min=228, max=461, avg=232
> COPY+CSUM 1500b: min=311, max=341, avg=317
> 
> ==
> 
> CSUM_COPY 32b: min=59, max=382, avg=62
> CSUM_COPY 1024b: min=198, max=383, avg=201
> CSUM_COPY 1500b: min=285, max=21235, avg=289   <=================
> COPY+CSUM 32b: min=52, max=300, avg=56
> COPY+CSUM 1024b: min=228, max=348, avg=232
> COPY+CSUM 1500b: min=311, max=409, avg=317
> 
> Cortex A9 quad-core 1.2Ghz (iMX6qp)
> -----------------------------------
> 
> model name	: ARMv7 Processor rev 10 (v7l)
> Features	: half thumb fastmult vfp edsp neon vfpv3 tls vfpd32 
> CPU implementer	: 0x41
> CPU architecture: 7
> CPU variant	: 0x2
> CPU part	: 0xc09
> CPU revision	: 10
> 
> CSUM_COPY 32b: min=333, max=1334, avg=440
> CSUM_COPY 1024b: min=1000, max=2666, avg=1060
> CSUM_COPY 1500b: min=1333, max=45333, avg=1357   <=================
> COPY+CSUM 32b: min=333, max=1334, avg=476
> COPY+CSUM 1024b: min=1000, max=2333, avg=1324
> COPY+CSUM 1500b: min=1666, max=2334, avg=1713
> 
> ==
> 
> CSUM_COPY 32b: min=333, max=1334, avg=439
> CSUM_COPY 1024b: min=1000, max=46000, avg=1060   <=================
> CSUM_COPY 1500b: min=1333, max=5000, avg=1351
> COPY+CSUM 32b: min=333, max=1334, avg=476
> COPY+CSUM 1024b: min=1000, max=2334, avg=1324
> COPY+CSUM 1500b: min=1666, max=2667, avg=1713
> 
> ==
> 
> CSUM_COPY 32b: min=333, max=1666, avg=454
> CSUM_COPY 1024b: min=1000, max=2000, avg=1060
> CSUM_COPY 1500b: min=1333, max=45000, avg=1348   <=================
> COPY+CSUM 32b: min=333, max=1334, avg=454
> COPY+CSUM 1024b: min=1000, max=2334, avg=1317
> COPY+CSUM 1500b: min=1666, max=6000, avg=1712
> 
> Cortex A55 quad-core 2Ghz (Odroid C4)
> -------------------------------------
> 
> Features        : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp
> CPU implementer : 0x41
> CPU architecture: 8
> CPU variant     : 0x1
> CPU part        : 0xd05
> CPU revision    : 0
> 
> 
> CSUM_COPY 32b: min=125, max=833, avg=140
> CSUM_COPY 1024b: min=625, max=41916, avg=673   <=================
> CSUM_COPY 1500b: min=875, max=3875, avg=923
> COPY+CSUM 32b: min=125, max=458, avg=140
> COPY+CSUM 1024b: min=625, max=1166, avg=666
> COPY+CSUM 1500b: min=875, max=1167, avg=913
> 
> ==
> 
> CSUM_COPY 32b: min=125, max=1292, avg=139
> CSUM_COPY 1024b: min=541, max=48333, avg=555
> CSUM_COPY 1500b: min=708, max=3458, avg=740
> COPY+CSUM 32b: min=125, max=292, avg=136
> COPY+CSUM 1024b: min=541, max=750, avg=556
> COPY+CSUM 1500b: min=708, max=834, avg=740
> 
> ==
> 
> CSUM_COPY 32b: min=125, max=833, avg=140
> CSUM_COPY 1024b: min=666, max=55667, avg=673   <=================
> CSUM_COPY 1500b: min=875, max=4208, avg=913
> COPY+CSUM 32b: min=125, max=375, avg=140
> COPY+CSUM 1024b: min=666, max=916, avg=673
> COPY+CSUM 1500b: min=875, max=1042, avg=913
> 
> ============
> 
> A few additional observations from looking at the implementation:
> 
> For memcpy, legacy x86[2] uses movsq, finishing with movsb to complete
> buffers of unaligned length. Current x86[1] uses ERMS-optimized movsb
> which is faster.
> 
> arm32/armv7 optimizes memcpy by loading up to 8 words in a single
> instruction. csum_and_copy loads/stores at best 4 words at a time,
> only when src and dst are 32bit aligned (which matches the test case).
> 
> arm64/armv8a uses load/store pair instructions to copy memory
> blocks. It does not have asm-optimized csum_and_copy support, so it
> uses the generic C version.
> 
> What could be inferred in terms of prefetching and speculation might
> explain some differences between the approaches too.
> 
> I would be interested in any converging / diverging results testing the
> same combo with a different test code, because from my standpoint,
> things do not seem as obvious as they are supposed to be at the moment.
> 

If copy+csum is not using any recent memcopy optimizations, that is an
argument for at least equivalent performance.

But I don't get yet where the huge jittery should be coming from. Were
the measurement loop preemptible? In that case I would expect a split
copy followed by another loop to csum should give much worse results as
it needs the cache to stay warm - while copy-csum only touches the data
once.

Jan

-- 
Siemens AG, T RDA IOT
Corporate Competence Center Embedded Linux


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 5/7] drivers/net: wrap csum_partial_copy_nocheck()
  2021-04-16 17:12               ` Jan Kiszka
@ 2021-04-18  9:18                 ` Philippe Gerum
  2021-04-18 15:50                   ` Philippe Gerum
  0 siblings, 1 reply; 27+ messages in thread
From: Philippe Gerum @ 2021-04-18  9:18 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: xenomai


Jan Kiszka <jan.kiszka@siemens.com> writes:

> On 16.04.21 18:48, Philippe Gerum wrote:
>> 
>> Jan Kiszka <jan.kiszka@siemens.com> writes:
>> 
>>> On 15.04.21 09:54, Philippe Gerum wrote:
>>>>
>>>> Jan Kiszka <jan.kiszka@siemens.com> writes:
>>>>
>>>>> On 15.04.21 09:21, Philippe Gerum wrote:
>>>>>>
>>>>>> Jan Kiszka <jan.kiszka@siemens.com> writes:
>>>>>>
>>>>>>> On 27.03.21 11:19, Philippe Gerum wrote:
>>>>>>>> From: Philippe Gerum <rpm@xenomai.org>
>>>>>>>>
>>>>>>>> Since v5.9-rc1, csum_partial_copy_nocheck() forces a zero seed as its
>>>>>>>> last argument to csum_partial(). According to #cc44c17baf7f3, passing
>>>>>>>> a non-zero value would not even yield the proper result on some
>>>>>>>> architectures.
>>>>>>>>
>>>>>>>> Nevertheless, the current ICMP code does expect a non-zero csum seed
>>>>>>>> to be used in the next computation, so let's wrap net_csum_copy() to
>>>>>>>> csum_partial_copy_nocheck() for pre-5.9 kernels, and open code it for
>>>>>>>> later kernels so that we still feed csum_partial() with the user-given
>>>>>>>> csum. We still expect the x86, ARM and arm64 implementations of
>>>>>>>> csum_partial() to do the right thing.
>>>>>>>>
>>>>>>>
>>>>>>> If that issue only affects the ICMP code path, why not only changing
>>>>>>> that, leaving rtskb_copy_and_csum_bits with the benefit of doing copy
>>>>>>> and csum in one step?
>>>>>>>
>>>>>>
>>>>>> As a result of #cc44c17baf7f3, I see no common helper available from the
>>>>>> kernel folding the copy and checksum operations anymore, so I see no way
>>>>>> to keep rtskb_copy_and_csum_bits() as is. Did you find an all-in-one
>>>>>> replacement for csum_partial_copy_nocheck() which would allow a csum
>>>>>> value to be fed in?
>>>>>>
>>>>>
>>>>> rtskb_copy_and_csum_dev does not need that.
>>>>>
>>>>
>>>> You made rtskb_copy_and_csum_bits() part of the exported API. So how do
>>>> you want to deal with this?
>>>>
>>>
>>> That is an internal API, so we don't care.
>>>
>>> But even if we converted rtskb_copy_and_csum_bits to work as it used to
>>> (with a csum != 0), there would be not reason to make
>>> rtskb_copy_and_csum_dev pay that price.
>>>
>> 
>> Well, there may be a reason to challenge the idea that a folded
>> copy_and_csum is better for a real-time system than a split memcpy+csum
>> in the first place. Out of curiosity, I ran a few benchmarks lately on a
>> few SoCs regarding this, and it turned out that optimizing the data copy
>> to get the buffer quickly in place before checksumming the result may
>> well yield much better results with respect to jitter than what
>> csum_and_copy currently achieves on these SoCs.
>> 
>> Inline csum_and_copy did perform slightly better on average (a couple of
>> hundreds of nanosecs at best) but with pathological jittery in the worst
>> case at times. On the contrary, the split memcpy+csum method did not
>> exhibit such jittery during these tests, not once.
>> 
>> - four SoCs tested (2 x x86, armv7, armv8a)
>> - test code ran in kernel space (real-time task context,
>>   out-of-band/primary context)
>> - csum_partial_copy_nocheck() vs memcpy()+csum_partial()
>> - 3 buffer sizes tested (32, 1024, 1500 bytes), 3 runs each
>> - all buffers (src & dst) aligned on L1_CACHE_BYTES
>> - each run performed 1000,000 iterations of a given checksum loop, no
>>   pause in between.
>> - no concurrent load on the board
>> - all results in nanoseconds
>> 
>> The worst results obtained are presented here for each SoC.
>> 
>> x86[1]
>> ------
>> 
>> vendor_id	: GenuineIntel
>> cpu family	: 6
>> model		: 92
>> model name	: Intel(R) Atom(TM) Processor E3940 @ 1.60GHz
>> stepping	: 9
>> cpu MHz		: 1593.600
>> cache size	: 1024 KB
>> cpuid level	: 21
>> wp		: yes
>> flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology tsc_reliable nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 ds_cpl vmx est tm2 ssse3 sdbg cx16 xtpr pdcm sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave rdrand lahf_lm 3dnowprefetch cpuid_fault cat_l2 pti tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust smep erms mpx rdt_a rdseed smap clflushopt intel_pt sha_ni xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts
>> vmx flags	: vnmi preemption_timer posted_intr invvpid ept_x_only ept_ad ept_1gb flexpriority apicv tsc_offset vtpr mtf vapic ept vpid unrestricted_guest vapic_reg vid ple shadow_vmcs
>> 
>> ==
>> 
>> CSUM_COPY 32b: min=68, max=653, avg=70
>> CSUM_COPY 1024b: min=248, max=373, avg=251
>> CSUM_COPY 1500b: min=344, max=3123, avg=350   <=================
>> COPY+CSUM 32b: min=101, max=790, avg=103
>> COPY+CSUM 1024b: min=297, max=397, avg=300
>> COPY+CSUM 1500b: min=402, max=490, avg=405
>> 
>> ==
>> 
>> CSUM_COPY 32b: min=68, max=1420, avg=70
>> CSUM_COPY 1024b: min=248, max=29706, avg=251   <=================
>> CSUM_COPY 1500b: min=344, max=792, avg=350
>> COPY+CSUM 32b: min=101, max=872, avg=103
>> COPY+CSUM 1024b: min=297, max=776, avg=300
>> COPY+CSUM 1500b: min=402, max=853, avg=405
>> 
>> ==
>> 
>> CSUM_COPY 32b: min=68, max=661, avg=70
>> CSUM_COPY 1024b: min=248, max=1714, avg=251
>> CSUM_COPY 1500b: min=344, max=33937, avg=350   <=================
>> COPY+CSUM 32b: min=101, max=610, avg=103
>> COPY+CSUM 1024b: min=297, max=605, avg=300
>> COPY+CSUM 1500b: min=402, max=712, avg=405
>> 
>> x86[2]
>> ------
>> 
>> vendor_id       : GenuineIntel
>> cpu family      : 6
>> model           : 23
>> model name      : Intel(R) Core(TM)2 Duo CPU     E7200  @ 2.53GHz
>> stepping        : 6
>> microcode       : 0x60c
>> cpu MHz         : 1627.113
>> cache size      : 3072 KB
>> cpuid level     : 10
>> wp              : yes
>> flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good nopl cpuid aperfmperf pni dtes64 monitor ds_cpl est tm2 ssse3 cx16 xtpr pdcm sse4_1 lahf_lm pti dtherm
>> 
>> CSUM_COPY 32b: min=558, max=31010, avg=674     <=================
>> CSUM_COPY 1024b: min=558, max=2794, avg=745
>> CSUM_COPY 1500b: min=558, max=2794, avg=841
>> COPY+CSUM 32b: min=558, max=2794, avg=671
>> COPY+CSUM 1024b: min=558, max=2794, avg=778
>> COPY+CSUM 1500b: min=838, max=2794, avg=865
>> 
>> ==
>> 
>> CSUM_COPY 32b: min=59, max=532, avg=62
>> CSUM_COPY 1024b: min=198, max=270, avg=201
>> CSUM_COPY 1500b: min=288, max=34921, avg=289   <=================
>> COPY+CSUM 32b: min=53, max=326, avg=56
>> COPY+CSUM 1024b: min=228, max=461, avg=232
>> COPY+CSUM 1500b: min=311, max=341, avg=317
>> 
>> ==
>> 
>> CSUM_COPY 32b: min=59, max=382, avg=62
>> CSUM_COPY 1024b: min=198, max=383, avg=201
>> CSUM_COPY 1500b: min=285, max=21235, avg=289   <=================
>> COPY+CSUM 32b: min=52, max=300, avg=56
>> COPY+CSUM 1024b: min=228, max=348, avg=232
>> COPY+CSUM 1500b: min=311, max=409, avg=317
>> 
>> Cortex A9 quad-core 1.2Ghz (iMX6qp)
>> -----------------------------------
>> 
>> model name	: ARMv7 Processor rev 10 (v7l)
>> Features	: half thumb fastmult vfp edsp neon vfpv3 tls vfpd32 
>> CPU implementer	: 0x41
>> CPU architecture: 7
>> CPU variant	: 0x2
>> CPU part	: 0xc09
>> CPU revision	: 10
>> 
>> CSUM_COPY 32b: min=333, max=1334, avg=440
>> CSUM_COPY 1024b: min=1000, max=2666, avg=1060
>> CSUM_COPY 1500b: min=1333, max=45333, avg=1357   <=================
>> COPY+CSUM 32b: min=333, max=1334, avg=476
>> COPY+CSUM 1024b: min=1000, max=2333, avg=1324
>> COPY+CSUM 1500b: min=1666, max=2334, avg=1713
>> 
>> ==
>> 
>> CSUM_COPY 32b: min=333, max=1334, avg=439
>> CSUM_COPY 1024b: min=1000, max=46000, avg=1060   <=================
>> CSUM_COPY 1500b: min=1333, max=5000, avg=1351
>> COPY+CSUM 32b: min=333, max=1334, avg=476
>> COPY+CSUM 1024b: min=1000, max=2334, avg=1324
>> COPY+CSUM 1500b: min=1666, max=2667, avg=1713
>> 
>> ==
>> 
>> CSUM_COPY 32b: min=333, max=1666, avg=454
>> CSUM_COPY 1024b: min=1000, max=2000, avg=1060
>> CSUM_COPY 1500b: min=1333, max=45000, avg=1348   <=================
>> COPY+CSUM 32b: min=333, max=1334, avg=454
>> COPY+CSUM 1024b: min=1000, max=2334, avg=1317
>> COPY+CSUM 1500b: min=1666, max=6000, avg=1712
>> 
>> Cortex A55 quad-core 2Ghz (Odroid C4)
>> -------------------------------------
>> 
>> Features        : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp
>> CPU implementer : 0x41
>> CPU architecture: 8
>> CPU variant     : 0x1
>> CPU part        : 0xd05
>> CPU revision    : 0
>> 
>> 
>> CSUM_COPY 32b: min=125, max=833, avg=140
>> CSUM_COPY 1024b: min=625, max=41916, avg=673   <=================
>> CSUM_COPY 1500b: min=875, max=3875, avg=923
>> COPY+CSUM 32b: min=125, max=458, avg=140
>> COPY+CSUM 1024b: min=625, max=1166, avg=666
>> COPY+CSUM 1500b: min=875, max=1167, avg=913
>> 
>> ==
>> 
>> CSUM_COPY 32b: min=125, max=1292, avg=139
>> CSUM_COPY 1024b: min=541, max=48333, avg=555
>> CSUM_COPY 1500b: min=708, max=3458, avg=740
>> COPY+CSUM 32b: min=125, max=292, avg=136
>> COPY+CSUM 1024b: min=541, max=750, avg=556
>> COPY+CSUM 1500b: min=708, max=834, avg=740
>> 
>> ==
>> 
>> CSUM_COPY 32b: min=125, max=833, avg=140
>> CSUM_COPY 1024b: min=666, max=55667, avg=673   <=================
>> CSUM_COPY 1500b: min=875, max=4208, avg=913
>> COPY+CSUM 32b: min=125, max=375, avg=140
>> COPY+CSUM 1024b: min=666, max=916, avg=673
>> COPY+CSUM 1500b: min=875, max=1042, avg=913
>> 
>> ============
>> 
>> A few additional observations from looking at the implementation:
>> 
>> For memcpy, legacy x86[2] uses movsq, finishing with movsb to complete
>> buffers of unaligned length. Current x86[1] uses ERMS-optimized movsb
>> which is faster.
>> 
>> arm32/armv7 optimizes memcpy by loading up to 8 words in a single
>> instruction. csum_and_copy loads/stores at best 4 words at a time,
>> only when src and dst are 32bit aligned (which matches the test case).
>> 
>> arm64/armv8a uses load/store pair instructions to copy memory
>> blocks. It does not have asm-optimized csum_and_copy support, so it
>> uses the generic C version.
>> 
>> What could be inferred in terms of prefetching and speculation might
>> explain some differences between the approaches too.
>> 
>> I would be interested in any converging / diverging results testing the
>> same combo with a different test code, because from my standpoint,
>> things do not seem as obvious as they are supposed to be at the moment.
>> 
>
> If copy+csum is not using any recent memcopy optimizations, that is an
> argument for at least equivalent performance.
>

You mean the folded version, i.e. copy_and_csum? If so, I can't see any
way for that one to optimize via fast string operations.

> But I don't get yet where the huge jittery should be coming from. Were
> the measurement loop preemptible? In that case I would expect a split

Out of band stage, so only preemptible by Xenomai timer ticks, which
means only the host tick emulation at this point since there was no
outstanding Xenomai timers started yet when running the loops. Pretty
slim chance to see these latency spots consistently reproduced, and only
for the folded copy_sum version.

> copy followed by another loop to csum should give much worse results as
> it needs the cache to stay warm - while copy-csum only touches the data
> once.
>

Conversely, if the copy is much faster, the odds of being preempted may
increase, yielding better results overall. This said, I'm unsure this is
related to preemption anyway, this looks like the fingerprints of minor
faults with PTEs. Why this would only happen in the folded version is
still a mystery to me at the moment.

-- 
Philippe.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 5/7] drivers/net: wrap csum_partial_copy_nocheck()
  2021-04-18  9:18                 ` Philippe Gerum
@ 2021-04-18 15:50                   ` Philippe Gerum
  2021-05-04 14:48                     ` Philippe Gerum
  0 siblings, 1 reply; 27+ messages in thread
From: Philippe Gerum @ 2021-04-18 15:50 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: xenomai


Philippe Gerum <rpm@xenomai.org> writes:

> Jan Kiszka <jan.kiszka@siemens.com> writes:
>
>> On 16.04.21 18:48, Philippe Gerum wrote:
>>> 
>>> Jan Kiszka <jan.kiszka@siemens.com> writes:
>>> 
>>>> On 15.04.21 09:54, Philippe Gerum wrote:
>>>>>
>>>>> Jan Kiszka <jan.kiszka@siemens.com> writes:
>>>>>
>>>>>> On 15.04.21 09:21, Philippe Gerum wrote:
>>>>>>>
>>>>>>> Jan Kiszka <jan.kiszka@siemens.com> writes:
>>>>>>>
>>>>>>>> On 27.03.21 11:19, Philippe Gerum wrote:
>>>>>>>>> From: Philippe Gerum <rpm@xenomai.org>
>>>>>>>>>
>>>>>>>>> Since v5.9-rc1, csum_partial_copy_nocheck() forces a zero seed as its
>>>>>>>>> last argument to csum_partial(). According to #cc44c17baf7f3, passing
>>>>>>>>> a non-zero value would not even yield the proper result on some
>>>>>>>>> architectures.
>>>>>>>>>
>>>>>>>>> Nevertheless, the current ICMP code does expect a non-zero csum seed
>>>>>>>>> to be used in the next computation, so let's wrap net_csum_copy() to
>>>>>>>>> csum_partial_copy_nocheck() for pre-5.9 kernels, and open code it for
>>>>>>>>> later kernels so that we still feed csum_partial() with the user-given
>>>>>>>>> csum. We still expect the x86, ARM and arm64 implementations of
>>>>>>>>> csum_partial() to do the right thing.
>>>>>>>>>
>>>>>>>>
>>>>>>>> If that issue only affects the ICMP code path, why not only changing
>>>>>>>> that, leaving rtskb_copy_and_csum_bits with the benefit of doing copy
>>>>>>>> and csum in one step?
>>>>>>>>
>>>>>>>
>>>>>>> As a result of #cc44c17baf7f3, I see no common helper available from the
>>>>>>> kernel folding the copy and checksum operations anymore, so I see no way
>>>>>>> to keep rtskb_copy_and_csum_bits() as is. Did you find an all-in-one
>>>>>>> replacement for csum_partial_copy_nocheck() which would allow a csum
>>>>>>> value to be fed in?
>>>>>>>
>>>>>>
>>>>>> rtskb_copy_and_csum_dev does not need that.
>>>>>>
>>>>>
>>>>> You made rtskb_copy_and_csum_bits() part of the exported API. So how do
>>>>> you want to deal with this?
>>>>>
>>>>
>>>> That is an internal API, so we don't care.
>>>>
>>>> But even if we converted rtskb_copy_and_csum_bits to work as it used to
>>>> (with a csum != 0), there would be not reason to make
>>>> rtskb_copy_and_csum_dev pay that price.
>>>>
>>> 
>>> Well, there may be a reason to challenge the idea that a folded
>>> copy_and_csum is better for a real-time system than a split memcpy+csum
>>> in the first place. Out of curiosity, I ran a few benchmarks lately on a
>>> few SoCs regarding this, and it turned out that optimizing the data copy
>>> to get the buffer quickly in place before checksumming the result may
>>> well yield much better results with respect to jitter than what
>>> csum_and_copy currently achieves on these SoCs.
>>> 
>>> Inline csum_and_copy did perform slightly better on average (a couple of
>>> hundreds of nanosecs at best) but with pathological jittery in the worst
>>> case at times. On the contrary, the split memcpy+csum method did not
>>> exhibit such jittery during these tests, not once.
>>> 
>>> - four SoCs tested (2 x x86, armv7, armv8a)
>>> - test code ran in kernel space (real-time task context,
>>>   out-of-band/primary context)
>>> - csum_partial_copy_nocheck() vs memcpy()+csum_partial()
>>> - 3 buffer sizes tested (32, 1024, 1500 bytes), 3 runs each
>>> - all buffers (src & dst) aligned on L1_CACHE_BYTES
>>> - each run performed 1000,000 iterations of a given checksum loop, no
>>>   pause in between.
>>> - no concurrent load on the board
>>> - all results in nanoseconds
>>> 
>>> The worst results obtained are presented here for each SoC.
>>> 
>>> x86[1]
>>> ------
>>> 
>>> vendor_id	: GenuineIntel
>>> cpu family	: 6
>>> model		: 92
>>> model name	: Intel(R) Atom(TM) Processor E3940 @ 1.60GHz
>>> stepping	: 9
>>> cpu MHz		: 1593.600
>>> cache size	: 1024 KB
>>> cpuid level	: 21
>>> wp		: yes
>>> flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology tsc_reliable nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 ds_cpl vmx est tm2 ssse3 sdbg cx16 xtpr pdcm sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave rdrand lahf_lm 3dnowprefetch cpuid_fault cat_l2 pti tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust smep erms mpx rdt_a rdseed smap clflushopt intel_pt sha_ni xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts
>>> vmx flags	: vnmi preemption_timer posted_intr invvpid ept_x_only ept_ad ept_1gb flexpriority apicv tsc_offset vtpr mtf vapic ept vpid unrestricted_guest vapic_reg vid ple shadow_vmcs
>>> 
>>> ==
>>> 
>>> CSUM_COPY 32b: min=68, max=653, avg=70
>>> CSUM_COPY 1024b: min=248, max=373, avg=251
>>> CSUM_COPY 1500b: min=344, max=3123, avg=350   <=================
>>> COPY+CSUM 32b: min=101, max=790, avg=103
>>> COPY+CSUM 1024b: min=297, max=397, avg=300
>>> COPY+CSUM 1500b: min=402, max=490, avg=405
>>> 
>>> ==
>>> 
>>> CSUM_COPY 32b: min=68, max=1420, avg=70
>>> CSUM_COPY 1024b: min=248, max=29706, avg=251   <=================
>>> CSUM_COPY 1500b: min=344, max=792, avg=350
>>> COPY+CSUM 32b: min=101, max=872, avg=103
>>> COPY+CSUM 1024b: min=297, max=776, avg=300
>>> COPY+CSUM 1500b: min=402, max=853, avg=405
>>> 
>>> ==
>>> 
>>> CSUM_COPY 32b: min=68, max=661, avg=70
>>> CSUM_COPY 1024b: min=248, max=1714, avg=251
>>> CSUM_COPY 1500b: min=344, max=33937, avg=350   <=================
>>> COPY+CSUM 32b: min=101, max=610, avg=103
>>> COPY+CSUM 1024b: min=297, max=605, avg=300
>>> COPY+CSUM 1500b: min=402, max=712, avg=405
>>> 
>>> x86[2]
>>> ------
>>> 
>>> vendor_id       : GenuineIntel
>>> cpu family      : 6
>>> model           : 23
>>> model name      : Intel(R) Core(TM)2 Duo CPU     E7200  @ 2.53GHz
>>> stepping        : 6
>>> microcode       : 0x60c
>>> cpu MHz         : 1627.113
>>> cache size      : 3072 KB
>>> cpuid level     : 10
>>> wp              : yes
>>> flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good nopl cpuid aperfmperf pni dtes64 monitor ds_cpl est tm2 ssse3 cx16 xtpr pdcm sse4_1 lahf_lm pti dtherm
>>> 
>>> CSUM_COPY 32b: min=558, max=31010, avg=674     <=================
>>> CSUM_COPY 1024b: min=558, max=2794, avg=745
>>> CSUM_COPY 1500b: min=558, max=2794, avg=841
>>> COPY+CSUM 32b: min=558, max=2794, avg=671
>>> COPY+CSUM 1024b: min=558, max=2794, avg=778
>>> COPY+CSUM 1500b: min=838, max=2794, avg=865
>>> 
>>> ==
>>> 
>>> CSUM_COPY 32b: min=59, max=532, avg=62
>>> CSUM_COPY 1024b: min=198, max=270, avg=201
>>> CSUM_COPY 1500b: min=288, max=34921, avg=289   <=================
>>> COPY+CSUM 32b: min=53, max=326, avg=56
>>> COPY+CSUM 1024b: min=228, max=461, avg=232
>>> COPY+CSUM 1500b: min=311, max=341, avg=317
>>> 
>>> ==
>>> 
>>> CSUM_COPY 32b: min=59, max=382, avg=62
>>> CSUM_COPY 1024b: min=198, max=383, avg=201
>>> CSUM_COPY 1500b: min=285, max=21235, avg=289   <=================
>>> COPY+CSUM 32b: min=52, max=300, avg=56
>>> COPY+CSUM 1024b: min=228, max=348, avg=232
>>> COPY+CSUM 1500b: min=311, max=409, avg=317
>>> 
>>> Cortex A9 quad-core 1.2Ghz (iMX6qp)
>>> -----------------------------------
>>> 
>>> model name	: ARMv7 Processor rev 10 (v7l)
>>> Features	: half thumb fastmult vfp edsp neon vfpv3 tls vfpd32 
>>> CPU implementer	: 0x41
>>> CPU architecture: 7
>>> CPU variant	: 0x2
>>> CPU part	: 0xc09
>>> CPU revision	: 10
>>> 
>>> CSUM_COPY 32b: min=333, max=1334, avg=440
>>> CSUM_COPY 1024b: min=1000, max=2666, avg=1060
>>> CSUM_COPY 1500b: min=1333, max=45333, avg=1357   <=================
>>> COPY+CSUM 32b: min=333, max=1334, avg=476
>>> COPY+CSUM 1024b: min=1000, max=2333, avg=1324
>>> COPY+CSUM 1500b: min=1666, max=2334, avg=1713
>>> 
>>> ==
>>> 
>>> CSUM_COPY 32b: min=333, max=1334, avg=439
>>> CSUM_COPY 1024b: min=1000, max=46000, avg=1060   <=================
>>> CSUM_COPY 1500b: min=1333, max=5000, avg=1351
>>> COPY+CSUM 32b: min=333, max=1334, avg=476
>>> COPY+CSUM 1024b: min=1000, max=2334, avg=1324
>>> COPY+CSUM 1500b: min=1666, max=2667, avg=1713
>>> 
>>> ==
>>> 
>>> CSUM_COPY 32b: min=333, max=1666, avg=454
>>> CSUM_COPY 1024b: min=1000, max=2000, avg=1060
>>> CSUM_COPY 1500b: min=1333, max=45000, avg=1348   <=================
>>> COPY+CSUM 32b: min=333, max=1334, avg=454
>>> COPY+CSUM 1024b: min=1000, max=2334, avg=1317
>>> COPY+CSUM 1500b: min=1666, max=6000, avg=1712
>>> 
>>> Cortex A55 quad-core 2Ghz (Odroid C4)
>>> -------------------------------------
>>> 
>>> Features        : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp
>>> CPU implementer : 0x41
>>> CPU architecture: 8
>>> CPU variant     : 0x1
>>> CPU part        : 0xd05
>>> CPU revision    : 0
>>> 
>>> 
>>> CSUM_COPY 32b: min=125, max=833, avg=140
>>> CSUM_COPY 1024b: min=625, max=41916, avg=673   <=================
>>> CSUM_COPY 1500b: min=875, max=3875, avg=923
>>> COPY+CSUM 32b: min=125, max=458, avg=140
>>> COPY+CSUM 1024b: min=625, max=1166, avg=666
>>> COPY+CSUM 1500b: min=875, max=1167, avg=913
>>> 
>>> ==
>>> 
>>> CSUM_COPY 32b: min=125, max=1292, avg=139
>>> CSUM_COPY 1024b: min=541, max=48333, avg=555
>>> CSUM_COPY 1500b: min=708, max=3458, avg=740
>>> COPY+CSUM 32b: min=125, max=292, avg=136
>>> COPY+CSUM 1024b: min=541, max=750, avg=556
>>> COPY+CSUM 1500b: min=708, max=834, avg=740
>>> 
>>> ==
>>> 
>>> CSUM_COPY 32b: min=125, max=833, avg=140
>>> CSUM_COPY 1024b: min=666, max=55667, avg=673   <=================
>>> CSUM_COPY 1500b: min=875, max=4208, avg=913
>>> COPY+CSUM 32b: min=125, max=375, avg=140
>>> COPY+CSUM 1024b: min=666, max=916, avg=673
>>> COPY+CSUM 1500b: min=875, max=1042, avg=913
>>> 
>>> ============
>>> 
>>> A few additional observations from looking at the implementation:
>>> 
>>> For memcpy, legacy x86[2] uses movsq, finishing with movsb to complete
>>> buffers of unaligned length. Current x86[1] uses ERMS-optimized movsb
>>> which is faster.
>>> 
>>> arm32/armv7 optimizes memcpy by loading up to 8 words in a single
>>> instruction. csum_and_copy loads/stores at best 4 words at a time,
>>> only when src and dst are 32bit aligned (which matches the test case).
>>> 
>>> arm64/armv8a uses load/store pair instructions to copy memory
>>> blocks. It does not have asm-optimized csum_and_copy support, so it
>>> uses the generic C version.
>>> 
>>> What could be inferred in terms of prefetching and speculation might
>>> explain some differences between the approaches too.
>>> 
>>> I would be interested in any converging / diverging results testing the
>>> same combo with a different test code, because from my standpoint,
>>> things do not seem as obvious as they are supposed to be at the moment.
>>> 
>>
>> If copy+csum is not using any recent memcopy optimizations, that is an
>> argument for at least equivalent performance.
>>
>
> You mean the folded version, i.e. copy_and_csum? If so, I can't see any
> way for that one to optimize via fast string operations.
>
>> But I don't get yet where the huge jittery should be coming from. Were
>> the measurement loop preemptible? In that case I would expect a split
>
> Out of band stage, so only preemptible by Xenomai timer ticks, which
> means only the host tick emulation at this point since there was no
> outstanding Xenomai timers started yet when running the loops. Pretty
> slim chance to see these latency spots consistently reproduced, and only
> for the folded copy_sum version.
>
>> copy followed by another loop to csum should give much worse results as
>> it needs the cache to stay warm - while copy-csum only touches the data
>> once.
>>
>
> Conversely, if the copy is much faster, the odds of being preempted may
> increase, yielding better results overall.

False alarm. Preemption was the issue, by the top half of the host tick
handling in primary mode. The latest clock event scheduled by the kernel
managed to enter the pipeline at a random time, but always within the
execution window of the all-in-one csum_and_copy code. Although this
event was deferred and not immediately passed to the in-band context,
the time spent dealing with it was enough to show up in the results.

> This said, I'm unsure this is
> related to preemption anyway, this looks like the fingerprints of minor
> faults with PTEs. Why this would only happen in the folded version is
> still a mystery to me at the moment.

It did not actually, no minor faults.

The results are now consistent, both implementations are comparable
performance-wise as the optimized memcpy tends to offset the advantage
of calculating the checksum on the fly, saving a read access. armv8
benefits more from the former, since it does not have an optimized
csum_and_copy but uses the generic C version instead.

== x86[1]

CSUM_COPY 32b: min=68, max=640, avg=70
CSUM_COPY 1024b: min=247, max=773, avg=252
CSUM_COPY 1500b: min=343, max=832, avg=350
COPY+CSUM 32b: min=100, max=651, avg=131
COPY+CSUM 1024b: min=296, max=752, avg=298
COPY+CSUM 1500b: min=397, max=845, avg=400

== x86[2]

CSUM_COPY 32b: min=63, max=267, avg=66
CSUM_COPY 1024b: min=198, max=300, avg=201
CSUM_COPY 1500b: min=288, max=611, avg=291
COPY+CSUM 32b: min=56, max=360, avg=56
COPY+CSUM 1024b: min=228, max=420, avg=231
COPY+CSUM 1500b: min=307, max=337, avg=318

== armv7 (imx6qp)

CSUM_COPY 32b: min=333, max=1334, avg=439
CSUM_COPY 1024b: min=1000, max=2000, avg=1045
CSUM_COPY 1500b: min=1000, max=2334, avg=1325
COPY+CSUM 32b: min=333, max=1334, avg=454
COPY+CSUM 1024b: min=1333, max=2334, avg=1347
COPY+CSUM 1500b: min=1666, max=2667, avg=1734

== armv8a (C4)

CSUM_COPY 32b: min=125, max=792, avg=130
CSUM_COPY 1024b: min=500, max=1125, avg=550
CSUM_COPY 1500b: min=708, max=1833, avg=726
COPY+CSUM 32b: min=125, max=292, avg=130
COPY+CSUM 1024b: min=541, max=708, avg=550
COPY+CSUM 1500b: min=708, max=875, avg=730

-- 
Philippe.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 5/7] drivers/net: wrap csum_partial_copy_nocheck()
  2021-04-18 15:50                   ` Philippe Gerum
@ 2021-05-04 14:48                     ` Philippe Gerum
  2021-05-05  5:43                       ` Jan Kiszka
  0 siblings, 1 reply; 27+ messages in thread
From: Philippe Gerum @ 2021-05-04 14:48 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: xenomai


Philippe Gerum <rpm@xenomai.org> writes:

> Philippe Gerum <rpm@xenomai.org> writes:
>
>> Jan Kiszka <jan.kiszka@siemens.com> writes:
>>
>>> On 16.04.21 18:48, Philippe Gerum wrote:
>>>> 
>>>> Jan Kiszka <jan.kiszka@siemens.com> writes:
>>>> 
>>>>> On 15.04.21 09:54, Philippe Gerum wrote:
>>>>>>
>>>>>> Jan Kiszka <jan.kiszka@siemens.com> writes:
>>>>>>
>>>>>>> On 15.04.21 09:21, Philippe Gerum wrote:
>>>>>>>>
>>>>>>>> Jan Kiszka <jan.kiszka@siemens.com> writes:
>>>>>>>>
>>>>>>>>> On 27.03.21 11:19, Philippe Gerum wrote:
>>>>>>>>>> From: Philippe Gerum <rpm@xenomai.org>
>>>>>>>>>>
>>>>>>>>>> Since v5.9-rc1, csum_partial_copy_nocheck() forces a zero seed as its
>>>>>>>>>> last argument to csum_partial(). According to #cc44c17baf7f3, passing
>>>>>>>>>> a non-zero value would not even yield the proper result on some
>>>>>>>>>> architectures.
>>>>>>>>>>
>>>>>>>>>> Nevertheless, the current ICMP code does expect a non-zero csum seed
>>>>>>>>>> to be used in the next computation, so let's wrap net_csum_copy() to
>>>>>>>>>> csum_partial_copy_nocheck() for pre-5.9 kernels, and open code it for
>>>>>>>>>> later kernels so that we still feed csum_partial() with the user-given
>>>>>>>>>> csum. We still expect the x86, ARM and arm64 implementations of
>>>>>>>>>> csum_partial() to do the right thing.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> If that issue only affects the ICMP code path, why not only changing
>>>>>>>>> that, leaving rtskb_copy_and_csum_bits with the benefit of doing copy
>>>>>>>>> and csum in one step?
>>>>>>>>>
>>>>>>>>
>>>>>>>> As a result of #cc44c17baf7f3, I see no common helper available from the
>>>>>>>> kernel folding the copy and checksum operations anymore, so I see no way
>>>>>>>> to keep rtskb_copy_and_csum_bits() as is. Did you find an all-in-one
>>>>>>>> replacement for csum_partial_copy_nocheck() which would allow a csum
>>>>>>>> value to be fed in?
>>>>>>>>
>>>>>>>
>>>>>>> rtskb_copy_and_csum_dev does not need that.
>>>>>>>
>>>>>>
>>>>>> You made rtskb_copy_and_csum_bits() part of the exported API. So how do
>>>>>> you want to deal with this?
>>>>>>
>>>>>
>>>>> That is an internal API, so we don't care.
>>>>>
>>>>> But even if we converted rtskb_copy_and_csum_bits to work as it used to
>>>>> (with a csum != 0), there would be not reason to make
>>>>> rtskb_copy_and_csum_dev pay that price.
>>>>>
>>>> 
>>>> Well, there may be a reason to challenge the idea that a folded
>>>> copy_and_csum is better for a real-time system than a split memcpy+csum
>>>> in the first place. Out of curiosity, I ran a few benchmarks lately on a
>>>> few SoCs regarding this, and it turned out that optimizing the data copy
>>>> to get the buffer quickly in place before checksumming the result may
>>>> well yield much better results with respect to jitter than what
>>>> csum_and_copy currently achieves on these SoCs.
>>>> 
>>>> Inline csum_and_copy did perform slightly better on average (a couple of
>>>> hundreds of nanosecs at best) but with pathological jittery in the worst
>>>> case at times. On the contrary, the split memcpy+csum method did not
>>>> exhibit such jittery during these tests, not once.
>>>> 
>>>> - four SoCs tested (2 x x86, armv7, armv8a)
>>>> - test code ran in kernel space (real-time task context,
>>>>   out-of-band/primary context)
>>>> - csum_partial_copy_nocheck() vs memcpy()+csum_partial()
>>>> - 3 buffer sizes tested (32, 1024, 1500 bytes), 3 runs each
>>>> - all buffers (src & dst) aligned on L1_CACHE_BYTES
>>>> - each run performed 1000,000 iterations of a given checksum loop, no
>>>>   pause in between.
>>>> - no concurrent load on the board
>>>> - all results in nanoseconds
>>>> 
>>>> The worst results obtained are presented here for each SoC.
>>>> 
>>>> x86[1]
>>>> ------
>>>> 
>>>> vendor_id	: GenuineIntel
>>>> cpu family	: 6
>>>> model		: 92
>>>> model name	: Intel(R) Atom(TM) Processor E3940 @ 1.60GHz
>>>> stepping	: 9
>>>> cpu MHz		: 1593.600
>>>> cache size	: 1024 KB
>>>> cpuid level	: 21
>>>> wp		: yes
>>>> flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology tsc_reliable nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 ds_cpl vmx est tm2 ssse3 sdbg cx16 xtpr pdcm sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave rdrand lahf_lm 3dnowprefetch cpuid_fault cat_l2 pti tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust smep erms mpx rdt_a rdseed smap clflushopt intel_pt sha_ni xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts
>>>> vmx flags	: vnmi preemption_timer posted_intr invvpid ept_x_only ept_ad ept_1gb flexpriority apicv tsc_offset vtpr mtf vapic ept vpid unrestricted_guest vapic_reg vid ple shadow_vmcs
>>>> 
>>>> ==
>>>> 
>>>> CSUM_COPY 32b: min=68, max=653, avg=70
>>>> CSUM_COPY 1024b: min=248, max=373, avg=251
>>>> CSUM_COPY 1500b: min=344, max=3123, avg=350   <=================
>>>> COPY+CSUM 32b: min=101, max=790, avg=103
>>>> COPY+CSUM 1024b: min=297, max=397, avg=300
>>>> COPY+CSUM 1500b: min=402, max=490, avg=405
>>>> 
>>>> ==
>>>> 
>>>> CSUM_COPY 32b: min=68, max=1420, avg=70
>>>> CSUM_COPY 1024b: min=248, max=29706, avg=251   <=================
>>>> CSUM_COPY 1500b: min=344, max=792, avg=350
>>>> COPY+CSUM 32b: min=101, max=872, avg=103
>>>> COPY+CSUM 1024b: min=297, max=776, avg=300
>>>> COPY+CSUM 1500b: min=402, max=853, avg=405
>>>> 
>>>> ==
>>>> 
>>>> CSUM_COPY 32b: min=68, max=661, avg=70
>>>> CSUM_COPY 1024b: min=248, max=1714, avg=251
>>>> CSUM_COPY 1500b: min=344, max=33937, avg=350   <=================
>>>> COPY+CSUM 32b: min=101, max=610, avg=103
>>>> COPY+CSUM 1024b: min=297, max=605, avg=300
>>>> COPY+CSUM 1500b: min=402, max=712, avg=405
>>>> 
>>>> x86[2]
>>>> ------
>>>> 
>>>> vendor_id       : GenuineIntel
>>>> cpu family      : 6
>>>> model           : 23
>>>> model name      : Intel(R) Core(TM)2 Duo CPU     E7200  @ 2.53GHz
>>>> stepping        : 6
>>>> microcode       : 0x60c
>>>> cpu MHz         : 1627.113
>>>> cache size      : 3072 KB
>>>> cpuid level     : 10
>>>> wp              : yes
>>>> flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good nopl cpuid aperfmperf pni dtes64 monitor ds_cpl est tm2 ssse3 cx16 xtpr pdcm sse4_1 lahf_lm pti dtherm
>>>> 
>>>> CSUM_COPY 32b: min=558, max=31010, avg=674     <=================
>>>> CSUM_COPY 1024b: min=558, max=2794, avg=745
>>>> CSUM_COPY 1500b: min=558, max=2794, avg=841
>>>> COPY+CSUM 32b: min=558, max=2794, avg=671
>>>> COPY+CSUM 1024b: min=558, max=2794, avg=778
>>>> COPY+CSUM 1500b: min=838, max=2794, avg=865
>>>> 
>>>> ==
>>>> 
>>>> CSUM_COPY 32b: min=59, max=532, avg=62
>>>> CSUM_COPY 1024b: min=198, max=270, avg=201
>>>> CSUM_COPY 1500b: min=288, max=34921, avg=289   <=================
>>>> COPY+CSUM 32b: min=53, max=326, avg=56
>>>> COPY+CSUM 1024b: min=228, max=461, avg=232
>>>> COPY+CSUM 1500b: min=311, max=341, avg=317
>>>> 
>>>> ==
>>>> 
>>>> CSUM_COPY 32b: min=59, max=382, avg=62
>>>> CSUM_COPY 1024b: min=198, max=383, avg=201
>>>> CSUM_COPY 1500b: min=285, max=21235, avg=289   <=================
>>>> COPY+CSUM 32b: min=52, max=300, avg=56
>>>> COPY+CSUM 1024b: min=228, max=348, avg=232
>>>> COPY+CSUM 1500b: min=311, max=409, avg=317
>>>> 
>>>> Cortex A9 quad-core 1.2Ghz (iMX6qp)
>>>> -----------------------------------
>>>> 
>>>> model name	: ARMv7 Processor rev 10 (v7l)
>>>> Features	: half thumb fastmult vfp edsp neon vfpv3 tls vfpd32 
>>>> CPU implementer	: 0x41
>>>> CPU architecture: 7
>>>> CPU variant	: 0x2
>>>> CPU part	: 0xc09
>>>> CPU revision	: 10
>>>> 
>>>> CSUM_COPY 32b: min=333, max=1334, avg=440
>>>> CSUM_COPY 1024b: min=1000, max=2666, avg=1060
>>>> CSUM_COPY 1500b: min=1333, max=45333, avg=1357   <=================
>>>> COPY+CSUM 32b: min=333, max=1334, avg=476
>>>> COPY+CSUM 1024b: min=1000, max=2333, avg=1324
>>>> COPY+CSUM 1500b: min=1666, max=2334, avg=1713
>>>> 
>>>> ==
>>>> 
>>>> CSUM_COPY 32b: min=333, max=1334, avg=439
>>>> CSUM_COPY 1024b: min=1000, max=46000, avg=1060   <=================
>>>> CSUM_COPY 1500b: min=1333, max=5000, avg=1351
>>>> COPY+CSUM 32b: min=333, max=1334, avg=476
>>>> COPY+CSUM 1024b: min=1000, max=2334, avg=1324
>>>> COPY+CSUM 1500b: min=1666, max=2667, avg=1713
>>>> 
>>>> ==
>>>> 
>>>> CSUM_COPY 32b: min=333, max=1666, avg=454
>>>> CSUM_COPY 1024b: min=1000, max=2000, avg=1060
>>>> CSUM_COPY 1500b: min=1333, max=45000, avg=1348   <=================
>>>> COPY+CSUM 32b: min=333, max=1334, avg=454
>>>> COPY+CSUM 1024b: min=1000, max=2334, avg=1317
>>>> COPY+CSUM 1500b: min=1666, max=6000, avg=1712
>>>> 
>>>> Cortex A55 quad-core 2Ghz (Odroid C4)
>>>> -------------------------------------
>>>> 
>>>> Features        : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp
>>>> CPU implementer : 0x41
>>>> CPU architecture: 8
>>>> CPU variant     : 0x1
>>>> CPU part        : 0xd05
>>>> CPU revision    : 0
>>>> 
>>>> 
>>>> CSUM_COPY 32b: min=125, max=833, avg=140
>>>> CSUM_COPY 1024b: min=625, max=41916, avg=673   <=================
>>>> CSUM_COPY 1500b: min=875, max=3875, avg=923
>>>> COPY+CSUM 32b: min=125, max=458, avg=140
>>>> COPY+CSUM 1024b: min=625, max=1166, avg=666
>>>> COPY+CSUM 1500b: min=875, max=1167, avg=913
>>>> 
>>>> ==
>>>> 
>>>> CSUM_COPY 32b: min=125, max=1292, avg=139
>>>> CSUM_COPY 1024b: min=541, max=48333, avg=555
>>>> CSUM_COPY 1500b: min=708, max=3458, avg=740
>>>> COPY+CSUM 32b: min=125, max=292, avg=136
>>>> COPY+CSUM 1024b: min=541, max=750, avg=556
>>>> COPY+CSUM 1500b: min=708, max=834, avg=740
>>>> 
>>>> ==
>>>> 
>>>> CSUM_COPY 32b: min=125, max=833, avg=140
>>>> CSUM_COPY 1024b: min=666, max=55667, avg=673   <=================
>>>> CSUM_COPY 1500b: min=875, max=4208, avg=913
>>>> COPY+CSUM 32b: min=125, max=375, avg=140
>>>> COPY+CSUM 1024b: min=666, max=916, avg=673
>>>> COPY+CSUM 1500b: min=875, max=1042, avg=913
>>>> 
>>>> ============
>>>> 
>>>> A few additional observations from looking at the implementation:
>>>> 
>>>> For memcpy, legacy x86[2] uses movsq, finishing with movsb to complete
>>>> buffers of unaligned length. Current x86[1] uses ERMS-optimized movsb
>>>> which is faster.
>>>> 
>>>> arm32/armv7 optimizes memcpy by loading up to 8 words in a single
>>>> instruction. csum_and_copy loads/stores at best 4 words at a time,
>>>> only when src and dst are 32bit aligned (which matches the test case).
>>>> 
>>>> arm64/armv8a uses load/store pair instructions to copy memory
>>>> blocks. It does not have asm-optimized csum_and_copy support, so it
>>>> uses the generic C version.
>>>> 
>>>> What could be inferred in terms of prefetching and speculation might
>>>> explain some differences between the approaches too.
>>>> 
>>>> I would be interested in any converging / diverging results testing the
>>>> same combo with a different test code, because from my standpoint,
>>>> things do not seem as obvious as they are supposed to be at the moment.
>>>> 
>>>
>>> If copy+csum is not using any recent memcopy optimizations, that is an
>>> argument for at least equivalent performance.
>>>
>>
>> You mean the folded version, i.e. copy_and_csum? If so, I can't see any
>> way for that one to optimize via fast string operations.
>>
>>> But I don't get yet where the huge jittery should be coming from. Were
>>> the measurement loop preemptible? In that case I would expect a split
>>
>> Out of band stage, so only preemptible by Xenomai timer ticks, which
>> means only the host tick emulation at this point since there was no
>> outstanding Xenomai timers started yet when running the loops. Pretty
>> slim chance to see these latency spots consistently reproduced, and only
>> for the folded copy_sum version.
>>
>>> copy followed by another loop to csum should give much worse results as
>>> it needs the cache to stay warm - while copy-csum only touches the data
>>> once.
>>>
>>
>> Conversely, if the copy is much faster, the odds of being preempted may
>> increase, yielding better results overall.
>
> False alarm. Preemption was the issue, by the top half of the host tick
> handling in primary mode. The latest clock event scheduled by the kernel
> managed to enter the pipeline at a random time, but always within the
> execution window of the all-in-one csum_and_copy code. Although this
> event was deferred and not immediately passed to the in-band context,
> the time spent dealing with it was enough to show up in the results.
>
>> This said, I'm unsure this is
>> related to preemption anyway, this looks like the fingerprints of minor
>> faults with PTEs. Why this would only happen in the folded version is
>> still a mystery to me at the moment.
>
> It did not actually, no minor faults.
>
> The results are now consistent, both implementations are comparable
> performance-wise as the optimized memcpy tends to offset the advantage
> of calculating the checksum on the fly, saving a read access. armv8
> benefits more from the former, since it does not have an optimized
> csum_and_copy but uses the generic C version instead.
>
> == x86[1]
>
> CSUM_COPY 32b: min=68, max=640, avg=70
> CSUM_COPY 1024b: min=247, max=773, avg=252
> CSUM_COPY 1500b: min=343, max=832, avg=350
> COPY+CSUM 32b: min=100, max=651, avg=131
> COPY+CSUM 1024b: min=296, max=752, avg=298
> COPY+CSUM 1500b: min=397, max=845, avg=400
>
> == x86[2]
>
> CSUM_COPY 32b: min=63, max=267, avg=66
> CSUM_COPY 1024b: min=198, max=300, avg=201
> CSUM_COPY 1500b: min=288, max=611, avg=291
> COPY+CSUM 32b: min=56, max=360, avg=56
> COPY+CSUM 1024b: min=228, max=420, avg=231
> COPY+CSUM 1500b: min=307, max=337, avg=318
>
> == armv7 (imx6qp)
>
> CSUM_COPY 32b: min=333, max=1334, avg=439
> CSUM_COPY 1024b: min=1000, max=2000, avg=1045
> CSUM_COPY 1500b: min=1000, max=2334, avg=1325
> COPY+CSUM 32b: min=333, max=1334, avg=454
> COPY+CSUM 1024b: min=1333, max=2334, avg=1347
> COPY+CSUM 1500b: min=1666, max=2667, avg=1734
>
> == armv8a (C4)
>
> CSUM_COPY 32b: min=125, max=792, avg=130
> CSUM_COPY 1024b: min=500, max=1125, avg=550
> CSUM_COPY 1500b: min=708, max=1833, avg=726
> COPY+CSUM 32b: min=125, max=292, avg=130
> COPY+CSUM 1024b: min=541, max=708, avg=550
> COPY+CSUM 1500b: min=708, max=875, avg=730

Last round of results about this issue, now measuring the csum_copy vs
csum+copy performances in idle vs busy scenarios. Busy means
hackbench+dd loop streaming 128M in the background from zero -> null, in
order to badly trash the D-caches while the test runs. All figures in
nanosecs.

iMX6QP (Cortex A9)
------------------

=== idle

CSUM_COPY 32b: min=333, max=1333, avg=439
CSUM_COPY 1024b: min=1000, max=2000, avg=1045
CSUM_COPY 1500b: min=1333, max=2000, avg=1333
COPY+CSUM 32b: min=333, max=1333, avg=443
COPY+CSUM 1024b: min=1000, max=2334, avg=1345
COPY+CSUM 1500b: min=1666, max=2667, avg=1737

=== busy

CSUM_COPY 32b: min=333, max=4333, avg=466
CSUM_COPY 1024b: min=1000, max=5000, avg=1088
CSUM_COPY 1500b: min=1333, max=5667, avg=1393
COPY+CSUM 32b: min=333, max=1334, avg=454
COPY+CSUM 1024b: min=1000, max=2000, avg=1341
COPY+CSUM 1500b: min=1666, max=2666, avg=1745

C4 (Cortex A55)
---------------

=== idle

CSUM_COPY 32b: min=125, max=791, avg=130
CSUM_COPY 1024b: min=541, max=834, avg=550
CSUM_COPY 1500b: min=708, max=1875, avg=740
COPY+CSUM 32b: min=125, max=167, avg=133
COPY+CSUM 1024b: min=541, max=625, avg=553
COPY+CSUM 1500b: min=708, max=750, avg=730

=== busy

CSUM_COPY 32b: min=125, max=792, avg=133
CSUM_COPY 1024b: min=500, max=2000, avg=552
CSUM_COPY 1500b: min=708, max=1542, avg=744
COPY+CSUM 32b: min=125, max=375, avg=133
COPY+CSUM 1024b: min=500, max=709, avg=553
COPY+CSUM 1500b: min=708, max=916, avg=743

x86 (atom x5)
-------------

=== idle

CSUM_COPY 32b: min=67, max=590, avg=70
CSUM_COPY 1024b: min=245, max=385, avg=251
CSUM_COPY 1500b: min=343, max=521, avg=350
COPY+CSUM 32b: min=101, max=679, avg=117
COPY+CSUM 1024b: min=296, max=379, avg=298
COPY+CSUM 1500b: min=399, max=502, avg=404

== busy

CSUM_COPY 32b: min=65, max=709, avg=71
CSUM_COPY 1024b: min=243, max=702, avg=252
CSUM_COPY 1500b: min=340, max=1055, avg=351
COPY+CSUM 32b: min=100, max=665, avg=120
COPY+CSUM 1024b: min=295, max=669, avg=298
COPY+CSUM 1500b: min=399, max=686, avg=403

As expected from the code, arm64 which has no folded csum_copy
implementation makes the best of using the split copy+csum path. All
architectures seem to benefit from optimized memcpy under load when it
comes to the worst case execution time. x86 is less prone to jittery
under cache trashing than others as usual, but even there, the
max. figures for csum+copy in busy context look pretty much on par with
the csum_copy version.

-- 
Philippe.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 7/7] drivers/net: cfg: fix config file load up
  2021-04-07 10:35   ` Jan Kiszka
@ 2021-05-04 17:18     ` Philippe Gerum
  0 siblings, 0 replies; 27+ messages in thread
From: Philippe Gerum @ 2021-05-04 17:18 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: xenomai


Jan Kiszka <jan.kiszka@siemens.com> writes:

> On 27.03.21 11:19, Philippe Gerum wrote:
>> From: Philippe Gerum <rpm@xenomai.org>
>> 
>> set_fs() is on its way out, so we cannot open code a file read
>> operation by calling the VFS handler directly anymore, faking a user
>> address space.
>> 
>> We do have kernel interfaces for loading files though, particularly
>> kernel_read_file(). So let's use that one for loading the
>> configuration file contents. Unfortunately, the signature of this
>> service changed during the 5.9-rc cycle, so we have to resort to an
>> ugly wrapper to cope with all supported kernels once again. Sigh.
>> 
>> Signed-off-by: Philippe Gerum <rpm@xenomai.org>
>> ---
>>  .../include/asm-generic/xenomai/wrappers.h    | 15 ++++
>>  kernel/drivers/net/stack/rtcfg/rtcfg_ioctl.c  | 74 +++++++++----------
>>  2 files changed, 52 insertions(+), 37 deletions(-)
>> 
>> diff --git a/kernel/cobalt/include/asm-generic/xenomai/wrappers.h b/kernel/cobalt/include/asm-generic/xenomai/wrappers.h
>> index cfd28fc47..f15fe4389 100644
>> --- a/kernel/cobalt/include/asm-generic/xenomai/wrappers.h
>> +++ b/kernel/cobalt/include/asm-generic/xenomai/wrappers.h
>> @@ -220,4 +220,19 @@ devm_hwmon_device_register_with_groups(struct device *dev, const char *name,
>>  	})
>>  #endif
>>  
>> +#if LINUX_VERSION_CODE < KERNEL_VERSION(5,9,0)
>> +#define read_file_from_kernel(__file, __buf, __buf_size, __file_size, __id) \
>> +	({								\
>> +		loff_t ___file_size;					\
>> +		int __ret;						\
>> +		__ret = kernel_read_file(__file, __buf, &___file_size,	\
>> +				__buf_size, __id);			\
>> +		(*__file_size) = ___file_size;				\
>> +		__ret;							\
>> +	})
>> +#else
>> +#define read_file_from_kernel(__file, __buf, __buf_size, __file_size, __id) \
>> +	kernel_read_file(__file, 0, __buf, __buf_size, __file_size, __id)
>> +#endif
>> +
>>  #endif /* _COBALT_ASM_GENERIC_WRAPPERS_H */
>> diff --git a/kernel/drivers/net/stack/rtcfg/rtcfg_ioctl.c b/kernel/drivers/net/stack/rtcfg/rtcfg_ioctl.c
>> index 769b4e143..e460571c2 100644
>> --- a/kernel/drivers/net/stack/rtcfg/rtcfg_ioctl.c
>> +++ b/kernel/drivers/net/stack/rtcfg/rtcfg_ioctl.c
>> @@ -22,6 +22,7 @@
>>   *
>>   */
>>  
>> +#include <linux/fs.h>
>>  #include <linux/file.h>
>>  #include <linux/vmalloc.h>
>>  
>> @@ -196,6 +197,40 @@ void cleanup_cmd_detach(void *priv_data)
>>  		kfree_rtskb(cmd->args.detach.stage2_chain);
>>  }
>>  
>> +static int load_cfg_file(struct rtcfg_file *cfgfile, struct rtcfg_cmd *cmd)
>> +{
>> +	size_t file_size = 0;
>> +	struct file *filp;
>> +	loff_t i_size;
>> +	int ret;
>> +
>> +	filp = filp_open(cfgfile->name, O_RDONLY, 0);
>> +	if (IS_ERR(filp))
>> +		return PTR_ERR(filp);
>> +
>> +	i_size = i_size_read(file_inode(filp));
>> +	if (i_size <= 0) {
>> +		/* allocate buffer even for empty files */
>> +		cfgfile->buffer = vmalloc(1);
>> +	} else {
>> +		cfgfile->buffer = NULL;
>> +		/* Assume 1Mb should be enough for a config file... */
>
> This limitation is new, and I'm not sure if that would be a good idea.

This would be the size of a config file. 1MB would be too tight in some
real world scenario(s)? Anyway, we don't need to impose this limit, I
agree.

> But I need to read up the protocol details again.
>
> Why do we need that limit? We know i_size already, no?
>

Yes we do. So here is a better option I'm going to push upstream
instead:

diff --git a/kernel/drivers/net/stack/rtcfg/rtcfg_ioctl.c b/kernel/drivers/net/stack/rtcfg/rtcfg_ioctl.c
index e460571c2..158d7118f 100644
--- a/kernel/drivers/net/stack/rtcfg/rtcfg_ioctl.c
+++ b/kernel/drivers/net/stack/rtcfg/rtcfg_ioctl.c
@@ -213,10 +213,10 @@ static int load_cfg_file(struct rtcfg_file *cfgfile, struct rtcfg_cmd *cmd)
 		/* allocate buffer even for empty files */
 		cfgfile->buffer = vmalloc(1);
 	} else {
-		cfgfile->buffer = NULL;
-		/* Assume 1Mb should be enough for a config file... */
+		cfgfile->buffer = NULL; /* Leave allocation to the kernel. */
 		ret = read_file_from_kernel(filp, &cfgfile->buffer,
-				1UL << 30, &file_size, READING_UNKNOWN);
+					i_size_read(file_inode(filp)),
+					&file_size, READING_UNKNOWN);
 		if (ret < 0) {
 			fput(filp);
 			return ret;

-- 
Philippe.


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 5/7] drivers/net: wrap csum_partial_copy_nocheck()
  2021-05-04 14:48                     ` Philippe Gerum
@ 2021-05-05  5:43                       ` Jan Kiszka
  0 siblings, 0 replies; 27+ messages in thread
From: Jan Kiszka @ 2021-05-05  5:43 UTC (permalink / raw)
  To: Philippe Gerum; +Cc: xenomai

On 04.05.21 16:48, Philippe Gerum wrote:
> 
> Philippe Gerum <rpm@xenomai.org> writes:
> 
>> Philippe Gerum <rpm@xenomai.org> writes:
>>
>>> Jan Kiszka <jan.kiszka@siemens.com> writes:
>>>
>>>> On 16.04.21 18:48, Philippe Gerum wrote:
>>>>>
>>>>> Jan Kiszka <jan.kiszka@siemens.com> writes:
>>>>>
>>>>>> On 15.04.21 09:54, Philippe Gerum wrote:
>>>>>>>
>>>>>>> Jan Kiszka <jan.kiszka@siemens.com> writes:
>>>>>>>
>>>>>>>> On 15.04.21 09:21, Philippe Gerum wrote:
>>>>>>>>>
>>>>>>>>> Jan Kiszka <jan.kiszka@siemens.com> writes:
>>>>>>>>>
>>>>>>>>>> On 27.03.21 11:19, Philippe Gerum wrote:
>>>>>>>>>>> From: Philippe Gerum <rpm@xenomai.org>
>>>>>>>>>>>
>>>>>>>>>>> Since v5.9-rc1, csum_partial_copy_nocheck() forces a zero seed as its
>>>>>>>>>>> last argument to csum_partial(). According to #cc44c17baf7f3, passing
>>>>>>>>>>> a non-zero value would not even yield the proper result on some
>>>>>>>>>>> architectures.
>>>>>>>>>>>
>>>>>>>>>>> Nevertheless, the current ICMP code does expect a non-zero csum seed
>>>>>>>>>>> to be used in the next computation, so let's wrap net_csum_copy() to
>>>>>>>>>>> csum_partial_copy_nocheck() for pre-5.9 kernels, and open code it for
>>>>>>>>>>> later kernels so that we still feed csum_partial() with the user-given
>>>>>>>>>>> csum. We still expect the x86, ARM and arm64 implementations of
>>>>>>>>>>> csum_partial() to do the right thing.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> If that issue only affects the ICMP code path, why not only changing
>>>>>>>>>> that, leaving rtskb_copy_and_csum_bits with the benefit of doing copy
>>>>>>>>>> and csum in one step?
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> As a result of #cc44c17baf7f3, I see no common helper available from the
>>>>>>>>> kernel folding the copy and checksum operations anymore, so I see no way
>>>>>>>>> to keep rtskb_copy_and_csum_bits() as is. Did you find an all-in-one
>>>>>>>>> replacement for csum_partial_copy_nocheck() which would allow a csum
>>>>>>>>> value to be fed in?
>>>>>>>>>
>>>>>>>>
>>>>>>>> rtskb_copy_and_csum_dev does not need that.
>>>>>>>>
>>>>>>>
>>>>>>> You made rtskb_copy_and_csum_bits() part of the exported API. So how do
>>>>>>> you want to deal with this?
>>>>>>>
>>>>>>
>>>>>> That is an internal API, so we don't care.
>>>>>>
>>>>>> But even if we converted rtskb_copy_and_csum_bits to work as it used to
>>>>>> (with a csum != 0), there would be not reason to make
>>>>>> rtskb_copy_and_csum_dev pay that price.
>>>>>>
>>>>>
>>>>> Well, there may be a reason to challenge the idea that a folded
>>>>> copy_and_csum is better for a real-time system than a split memcpy+csum
>>>>> in the first place. Out of curiosity, I ran a few benchmarks lately on a
>>>>> few SoCs regarding this, and it turned out that optimizing the data copy
>>>>> to get the buffer quickly in place before checksumming the result may
>>>>> well yield much better results with respect to jitter than what
>>>>> csum_and_copy currently achieves on these SoCs.
>>>>>
>>>>> Inline csum_and_copy did perform slightly better on average (a couple of
>>>>> hundreds of nanosecs at best) but with pathological jittery in the worst
>>>>> case at times. On the contrary, the split memcpy+csum method did not
>>>>> exhibit such jittery during these tests, not once.
>>>>>
>>>>> - four SoCs tested (2 x x86, armv7, armv8a)
>>>>> - test code ran in kernel space (real-time task context,
>>>>>   out-of-band/primary context)
>>>>> - csum_partial_copy_nocheck() vs memcpy()+csum_partial()
>>>>> - 3 buffer sizes tested (32, 1024, 1500 bytes), 3 runs each
>>>>> - all buffers (src & dst) aligned on L1_CACHE_BYTES
>>>>> - each run performed 1000,000 iterations of a given checksum loop, no
>>>>>   pause in between.
>>>>> - no concurrent load on the board
>>>>> - all results in nanoseconds
>>>>>
>>>>> The worst results obtained are presented here for each SoC.
>>>>>
>>>>> x86[1]
>>>>> ------
>>>>>
>>>>> vendor_id	: GenuineIntel
>>>>> cpu family	: 6
>>>>> model		: 92
>>>>> model name	: Intel(R) Atom(TM) Processor E3940 @ 1.60GHz
>>>>> stepping	: 9
>>>>> cpu MHz		: 1593.600
>>>>> cache size	: 1024 KB
>>>>> cpuid level	: 21
>>>>> wp		: yes
>>>>> flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology tsc_reliable nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 ds_cpl vmx est tm2 ssse3 sdbg cx16 xtpr pdcm sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave rdrand lahf_lm 3dnowprefetch cpuid_fault cat_l2 pti tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust smep erms mpx rdt_a rdseed smap clflushopt intel_pt sha_ni xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts
>>>>> vmx flags	: vnmi preemption_timer posted_intr invvpid ept_x_only ept_ad ept_1gb flexpriority apicv tsc_offset vtpr mtf vapic ept vpid unrestricted_guest vapic_reg vid ple shadow_vmcs
>>>>>
>>>>> ==
>>>>>
>>>>> CSUM_COPY 32b: min=68, max=653, avg=70
>>>>> CSUM_COPY 1024b: min=248, max=373, avg=251
>>>>> CSUM_COPY 1500b: min=344, max=3123, avg=350   <=================
>>>>> COPY+CSUM 32b: min=101, max=790, avg=103
>>>>> COPY+CSUM 1024b: min=297, max=397, avg=300
>>>>> COPY+CSUM 1500b: min=402, max=490, avg=405
>>>>>
>>>>> ==
>>>>>
>>>>> CSUM_COPY 32b: min=68, max=1420, avg=70
>>>>> CSUM_COPY 1024b: min=248, max=29706, avg=251   <=================
>>>>> CSUM_COPY 1500b: min=344, max=792, avg=350
>>>>> COPY+CSUM 32b: min=101, max=872, avg=103
>>>>> COPY+CSUM 1024b: min=297, max=776, avg=300
>>>>> COPY+CSUM 1500b: min=402, max=853, avg=405
>>>>>
>>>>> ==
>>>>>
>>>>> CSUM_COPY 32b: min=68, max=661, avg=70
>>>>> CSUM_COPY 1024b: min=248, max=1714, avg=251
>>>>> CSUM_COPY 1500b: min=344, max=33937, avg=350   <=================
>>>>> COPY+CSUM 32b: min=101, max=610, avg=103
>>>>> COPY+CSUM 1024b: min=297, max=605, avg=300
>>>>> COPY+CSUM 1500b: min=402, max=712, avg=405
>>>>>
>>>>> x86[2]
>>>>> ------
>>>>>
>>>>> vendor_id       : GenuineIntel
>>>>> cpu family      : 6
>>>>> model           : 23
>>>>> model name      : Intel(R) Core(TM)2 Duo CPU     E7200  @ 2.53GHz
>>>>> stepping        : 6
>>>>> microcode       : 0x60c
>>>>> cpu MHz         : 1627.113
>>>>> cache size      : 3072 KB
>>>>> cpuid level     : 10
>>>>> wp              : yes
>>>>> flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good nopl cpuid aperfmperf pni dtes64 monitor ds_cpl est tm2 ssse3 cx16 xtpr pdcm sse4_1 lahf_lm pti dtherm
>>>>>
>>>>> CSUM_COPY 32b: min=558, max=31010, avg=674     <=================
>>>>> CSUM_COPY 1024b: min=558, max=2794, avg=745
>>>>> CSUM_COPY 1500b: min=558, max=2794, avg=841
>>>>> COPY+CSUM 32b: min=558, max=2794, avg=671
>>>>> COPY+CSUM 1024b: min=558, max=2794, avg=778
>>>>> COPY+CSUM 1500b: min=838, max=2794, avg=865
>>>>>
>>>>> ==
>>>>>
>>>>> CSUM_COPY 32b: min=59, max=532, avg=62
>>>>> CSUM_COPY 1024b: min=198, max=270, avg=201
>>>>> CSUM_COPY 1500b: min=288, max=34921, avg=289   <=================
>>>>> COPY+CSUM 32b: min=53, max=326, avg=56
>>>>> COPY+CSUM 1024b: min=228, max=461, avg=232
>>>>> COPY+CSUM 1500b: min=311, max=341, avg=317
>>>>>
>>>>> ==
>>>>>
>>>>> CSUM_COPY 32b: min=59, max=382, avg=62
>>>>> CSUM_COPY 1024b: min=198, max=383, avg=201
>>>>> CSUM_COPY 1500b: min=285, max=21235, avg=289   <=================
>>>>> COPY+CSUM 32b: min=52, max=300, avg=56
>>>>> COPY+CSUM 1024b: min=228, max=348, avg=232
>>>>> COPY+CSUM 1500b: min=311, max=409, avg=317
>>>>>
>>>>> Cortex A9 quad-core 1.2Ghz (iMX6qp)
>>>>> -----------------------------------
>>>>>
>>>>> model name	: ARMv7 Processor rev 10 (v7l)
>>>>> Features	: half thumb fastmult vfp edsp neon vfpv3 tls vfpd32 
>>>>> CPU implementer	: 0x41
>>>>> CPU architecture: 7
>>>>> CPU variant	: 0x2
>>>>> CPU part	: 0xc09
>>>>> CPU revision	: 10
>>>>>
>>>>> CSUM_COPY 32b: min=333, max=1334, avg=440
>>>>> CSUM_COPY 1024b: min=1000, max=2666, avg=1060
>>>>> CSUM_COPY 1500b: min=1333, max=45333, avg=1357   <=================
>>>>> COPY+CSUM 32b: min=333, max=1334, avg=476
>>>>> COPY+CSUM 1024b: min=1000, max=2333, avg=1324
>>>>> COPY+CSUM 1500b: min=1666, max=2334, avg=1713
>>>>>
>>>>> ==
>>>>>
>>>>> CSUM_COPY 32b: min=333, max=1334, avg=439
>>>>> CSUM_COPY 1024b: min=1000, max=46000, avg=1060   <=================
>>>>> CSUM_COPY 1500b: min=1333, max=5000, avg=1351
>>>>> COPY+CSUM 32b: min=333, max=1334, avg=476
>>>>> COPY+CSUM 1024b: min=1000, max=2334, avg=1324
>>>>> COPY+CSUM 1500b: min=1666, max=2667, avg=1713
>>>>>
>>>>> ==
>>>>>
>>>>> CSUM_COPY 32b: min=333, max=1666, avg=454
>>>>> CSUM_COPY 1024b: min=1000, max=2000, avg=1060
>>>>> CSUM_COPY 1500b: min=1333, max=45000, avg=1348   <=================
>>>>> COPY+CSUM 32b: min=333, max=1334, avg=454
>>>>> COPY+CSUM 1024b: min=1000, max=2334, avg=1317
>>>>> COPY+CSUM 1500b: min=1666, max=6000, avg=1712
>>>>>
>>>>> Cortex A55 quad-core 2Ghz (Odroid C4)
>>>>> -------------------------------------
>>>>>
>>>>> Features        : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp
>>>>> CPU implementer : 0x41
>>>>> CPU architecture: 8
>>>>> CPU variant     : 0x1
>>>>> CPU part        : 0xd05
>>>>> CPU revision    : 0
>>>>>
>>>>>
>>>>> CSUM_COPY 32b: min=125, max=833, avg=140
>>>>> CSUM_COPY 1024b: min=625, max=41916, avg=673   <=================
>>>>> CSUM_COPY 1500b: min=875, max=3875, avg=923
>>>>> COPY+CSUM 32b: min=125, max=458, avg=140
>>>>> COPY+CSUM 1024b: min=625, max=1166, avg=666
>>>>> COPY+CSUM 1500b: min=875, max=1167, avg=913
>>>>>
>>>>> ==
>>>>>
>>>>> CSUM_COPY 32b: min=125, max=1292, avg=139
>>>>> CSUM_COPY 1024b: min=541, max=48333, avg=555
>>>>> CSUM_COPY 1500b: min=708, max=3458, avg=740
>>>>> COPY+CSUM 32b: min=125, max=292, avg=136
>>>>> COPY+CSUM 1024b: min=541, max=750, avg=556
>>>>> COPY+CSUM 1500b: min=708, max=834, avg=740
>>>>>
>>>>> ==
>>>>>
>>>>> CSUM_COPY 32b: min=125, max=833, avg=140
>>>>> CSUM_COPY 1024b: min=666, max=55667, avg=673   <=================
>>>>> CSUM_COPY 1500b: min=875, max=4208, avg=913
>>>>> COPY+CSUM 32b: min=125, max=375, avg=140
>>>>> COPY+CSUM 1024b: min=666, max=916, avg=673
>>>>> COPY+CSUM 1500b: min=875, max=1042, avg=913
>>>>>
>>>>> ============
>>>>>
>>>>> A few additional observations from looking at the implementation:
>>>>>
>>>>> For memcpy, legacy x86[2] uses movsq, finishing with movsb to complete
>>>>> buffers of unaligned length. Current x86[1] uses ERMS-optimized movsb
>>>>> which is faster.
>>>>>
>>>>> arm32/armv7 optimizes memcpy by loading up to 8 words in a single
>>>>> instruction. csum_and_copy loads/stores at best 4 words at a time,
>>>>> only when src and dst are 32bit aligned (which matches the test case).
>>>>>
>>>>> arm64/armv8a uses load/store pair instructions to copy memory
>>>>> blocks. It does not have asm-optimized csum_and_copy support, so it
>>>>> uses the generic C version.
>>>>>
>>>>> What could be inferred in terms of prefetching and speculation might
>>>>> explain some differences between the approaches too.
>>>>>
>>>>> I would be interested in any converging / diverging results testing the
>>>>> same combo with a different test code, because from my standpoint,
>>>>> things do not seem as obvious as they are supposed to be at the moment.
>>>>>
>>>>
>>>> If copy+csum is not using any recent memcopy optimizations, that is an
>>>> argument for at least equivalent performance.
>>>>
>>>
>>> You mean the folded version, i.e. copy_and_csum? If so, I can't see any
>>> way for that one to optimize via fast string operations.
>>>
>>>> But I don't get yet where the huge jittery should be coming from. Were
>>>> the measurement loop preemptible? In that case I would expect a split
>>>
>>> Out of band stage, so only preemptible by Xenomai timer ticks, which
>>> means only the host tick emulation at this point since there was no
>>> outstanding Xenomai timers started yet when running the loops. Pretty
>>> slim chance to see these latency spots consistently reproduced, and only
>>> for the folded copy_sum version.
>>>
>>>> copy followed by another loop to csum should give much worse results as
>>>> it needs the cache to stay warm - while copy-csum only touches the data
>>>> once.
>>>>
>>>
>>> Conversely, if the copy is much faster, the odds of being preempted may
>>> increase, yielding better results overall.
>>
>> False alarm. Preemption was the issue, by the top half of the host tick
>> handling in primary mode. The latest clock event scheduled by the kernel
>> managed to enter the pipeline at a random time, but always within the
>> execution window of the all-in-one csum_and_copy code. Although this
>> event was deferred and not immediately passed to the in-band context,
>> the time spent dealing with it was enough to show up in the results.
>>
>>> This said, I'm unsure this is
>>> related to preemption anyway, this looks like the fingerprints of minor
>>> faults with PTEs. Why this would only happen in the folded version is
>>> still a mystery to me at the moment.
>>
>> It did not actually, no minor faults.
>>
>> The results are now consistent, both implementations are comparable
>> performance-wise as the optimized memcpy tends to offset the advantage
>> of calculating the checksum on the fly, saving a read access. armv8
>> benefits more from the former, since it does not have an optimized
>> csum_and_copy but uses the generic C version instead.
>>
>> == x86[1]
>>
>> CSUM_COPY 32b: min=68, max=640, avg=70
>> CSUM_COPY 1024b: min=247, max=773, avg=252
>> CSUM_COPY 1500b: min=343, max=832, avg=350
>> COPY+CSUM 32b: min=100, max=651, avg=131
>> COPY+CSUM 1024b: min=296, max=752, avg=298
>> COPY+CSUM 1500b: min=397, max=845, avg=400
>>
>> == x86[2]
>>
>> CSUM_COPY 32b: min=63, max=267, avg=66
>> CSUM_COPY 1024b: min=198, max=300, avg=201
>> CSUM_COPY 1500b: min=288, max=611, avg=291
>> COPY+CSUM 32b: min=56, max=360, avg=56
>> COPY+CSUM 1024b: min=228, max=420, avg=231
>> COPY+CSUM 1500b: min=307, max=337, avg=318
>>
>> == armv7 (imx6qp)
>>
>> CSUM_COPY 32b: min=333, max=1334, avg=439
>> CSUM_COPY 1024b: min=1000, max=2000, avg=1045
>> CSUM_COPY 1500b: min=1000, max=2334, avg=1325
>> COPY+CSUM 32b: min=333, max=1334, avg=454
>> COPY+CSUM 1024b: min=1333, max=2334, avg=1347
>> COPY+CSUM 1500b: min=1666, max=2667, avg=1734
>>
>> == armv8a (C4)
>>
>> CSUM_COPY 32b: min=125, max=792, avg=130
>> CSUM_COPY 1024b: min=500, max=1125, avg=550
>> CSUM_COPY 1500b: min=708, max=1833, avg=726
>> COPY+CSUM 32b: min=125, max=292, avg=130
>> COPY+CSUM 1024b: min=541, max=708, avg=550
>> COPY+CSUM 1500b: min=708, max=875, avg=730
> 
> Last round of results about this issue, now measuring the csum_copy vs
> csum+copy performances in idle vs busy scenarios. Busy means
> hackbench+dd loop streaming 128M in the background from zero -> null, in
> order to badly trash the D-caches while the test runs. All figures in
> nanosecs.
> 
> iMX6QP (Cortex A9)
> ------------------
> 
> === idle
> 
> CSUM_COPY 32b: min=333, max=1333, avg=439
> CSUM_COPY 1024b: min=1000, max=2000, avg=1045
> CSUM_COPY 1500b: min=1333, max=2000, avg=1333
> COPY+CSUM 32b: min=333, max=1333, avg=443
> COPY+CSUM 1024b: min=1000, max=2334, avg=1345
> COPY+CSUM 1500b: min=1666, max=2667, avg=1737
> 
> === busy
> 
> CSUM_COPY 32b: min=333, max=4333, avg=466
> CSUM_COPY 1024b: min=1000, max=5000, avg=1088
> CSUM_COPY 1500b: min=1333, max=5667, avg=1393
> COPY+CSUM 32b: min=333, max=1334, avg=454
> COPY+CSUM 1024b: min=1000, max=2000, avg=1341
> COPY+CSUM 1500b: min=1666, max=2666, avg=1745
> 
> C4 (Cortex A55)
> ---------------
> 
> === idle
> 
> CSUM_COPY 32b: min=125, max=791, avg=130
> CSUM_COPY 1024b: min=541, max=834, avg=550
> CSUM_COPY 1500b: min=708, max=1875, avg=740
> COPY+CSUM 32b: min=125, max=167, avg=133
> COPY+CSUM 1024b: min=541, max=625, avg=553
> COPY+CSUM 1500b: min=708, max=750, avg=730
> 
> === busy
> 
> CSUM_COPY 32b: min=125, max=792, avg=133
> CSUM_COPY 1024b: min=500, max=2000, avg=552
> CSUM_COPY 1500b: min=708, max=1542, avg=744
> COPY+CSUM 32b: min=125, max=375, avg=133
> COPY+CSUM 1024b: min=500, max=709, avg=553
> COPY+CSUM 1500b: min=708, max=916, avg=743
> 
> x86 (atom x5)
> -------------
> 
> === idle
> 
> CSUM_COPY 32b: min=67, max=590, avg=70
> CSUM_COPY 1024b: min=245, max=385, avg=251
> CSUM_COPY 1500b: min=343, max=521, avg=350
> COPY+CSUM 32b: min=101, max=679, avg=117
> COPY+CSUM 1024b: min=296, max=379, avg=298
> COPY+CSUM 1500b: min=399, max=502, avg=404
> 
> == busy
> 
> CSUM_COPY 32b: min=65, max=709, avg=71
> CSUM_COPY 1024b: min=243, max=702, avg=252
> CSUM_COPY 1500b: min=340, max=1055, avg=351
> COPY+CSUM 32b: min=100, max=665, avg=120
> COPY+CSUM 1024b: min=295, max=669, avg=298
> COPY+CSUM 1500b: min=399, max=686, avg=403
> 
> As expected from the code, arm64 which has no folded csum_copy
> implementation makes the best of using the split copy+csum path. All
> architectures seem to benefit from optimized memcpy under load when it
> comes to the worst case execution time. x86 is less prone to jittery
> under cache trashing than others as usual, but even there, the
> max. figures for csum+copy in busy context look pretty much on par with
> the csum_copy version.
> 

Then let's go for your conversion - but then possibly even
unconditionally, no?

Jan

-- 
Siemens AG, T RDA IOT
Corporate Competence Center Embedded Linux


^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2021-05-05  5:43 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-27 10:19 [PATCH v2 0/7] assorted v5.x related fixups Philippe Gerum
2021-03-27 10:19 ` [PATCH v2 1/7] cobalt/memory: fix __vmalloc() calls Philippe Gerum
2021-03-27 10:19 ` [PATCH v2 2/7] cobalt/debug: switch to mmap_lock interface Philippe Gerum
2021-03-27 10:19 ` [PATCH v2 3/7] cobalt/kernel: convert to proc_ops Philippe Gerum
2021-04-07  9:52   ` Jan Kiszka
2021-04-07 10:17     ` Philippe Gerum
2021-04-07 10:37       ` Jan Kiszka
2021-04-07 11:03         ` Philippe Gerum
2021-03-27 10:19 ` [PATCH v2 4/7] cobalt/debug: prefer dump_stack() to show_stack() Philippe Gerum
2021-03-27 10:19 ` [PATCH v2 5/7] drivers/net: wrap csum_partial_copy_nocheck() Philippe Gerum
2021-04-07 10:06   ` Jan Kiszka
2021-04-15  7:21     ` Philippe Gerum
2021-04-15  7:35       ` Jan Kiszka
2021-04-15  7:54         ` Philippe Gerum
2021-04-15  8:10           ` Jan Kiszka
2021-04-16 16:48             ` Philippe Gerum
2021-04-16 17:12               ` Jan Kiszka
2021-04-18  9:18                 ` Philippe Gerum
2021-04-18 15:50                   ` Philippe Gerum
2021-05-04 14:48                     ` Philippe Gerum
2021-05-05  5:43                       ` Jan Kiszka
2021-03-27 10:19 ` [PATCH v2 6/7] drivers/net: icmp: remove variable-length array Philippe Gerum
2021-04-07 10:24   ` Jan Kiszka
2021-03-27 10:19 ` [PATCH v2 7/7] drivers/net: cfg: fix config file load up Philippe Gerum
2021-04-07 10:35   ` Jan Kiszka
2021-05-04 17:18     ` Philippe Gerum
2021-04-07 10:39 ` [PATCH v2 0/7] assorted v5.x related fixups Jan Kiszka

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.