linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [RFC] Add a new generic system call which has better performance, to get /proc data, than existing mechanisms
@ 2022-11-18 19:12 Anjali Kulkarni
  2022-11-19  8:49 ` Greg KH
  0 siblings, 1 reply; 8+ messages in thread
From: Anjali Kulkarni @ 2022-11-18 19:12 UTC (permalink / raw)
  To: gregkh, linux-kernel; +Cc: anjali.k.kulkarni

Hi Greg,
I was looking up your readfile() system call and this seems something useful to us - is this something expected to go into mainline any time soon?
Thanks
Anjali

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC] Add a new generic system call which has better performance, to get /proc data, than existing mechanisms
  2022-11-18 19:12 [RFC] Add a new generic system call which has better performance, to get /proc data, than existing mechanisms Anjali Kulkarni
@ 2022-11-19  8:49 ` Greg KH
       [not found]   ` <MN2PR10MB414411D0E29F20412E6DF0C0C4089@MN2PR10MB4144.namprd10.prod.outlook.com>
  0 siblings, 1 reply; 8+ messages in thread
From: Greg KH @ 2022-11-19  8:49 UTC (permalink / raw)
  To: Anjali Kulkarni; +Cc: linux-kernel

On Fri, Nov 18, 2022 at 11:12:02AM -0800, Anjali Kulkarni wrote:
> Hi Greg,
> I was looking up your readfile() system call and this seems something
> useful to us - is this something expected to go into mainline any time
> soon?

Can you test it to see if it actually helps your workload?  All of the
ones I played with were just very minor improvements or lots in the
noise.

Also, look into using io_uring as that can probably do the same thing,
right?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC] Add a new generic system call which has better performance, to get /proc data, than existing mechanisms
       [not found]   ` <MN2PR10MB414411D0E29F20412E6DF0C0C4089@MN2PR10MB4144.namprd10.prod.outlook.com>
@ 2022-11-20  8:21     ` Greg KH
  2022-11-20 19:37       ` Anjali Kulkarni
  2022-11-21 18:28       ` Anjali Kulkarni
  0 siblings, 2 replies; 8+ messages in thread
From: Greg KH @ 2022-11-20  8:21 UTC (permalink / raw)
  To: Anjali Kulkarni; +Cc: linux-kernel

Note, html email is rejected by kernel mailing lists, and top-posting
does not work at all in discussions.  Please fix your email client if
you wish to participate in kernel development.

On Sat, Nov 19, 2022 at 05:50:03PM +0000, Anjali Kulkarni wrote:
> I will give it a try, but the majority of the savings are due to avoiding the conversion from binary to string in /proc.

That goes contrary to your previous statement saying that the readfile
call would help out here.

And there might be ways to convert binary to strings, perhaps look into
doing that?

good luck,

greg k-h

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC] Add a new generic system call which has better performance, to get /proc data, than existing mechanisms
  2022-11-20  8:21     ` Greg KH
@ 2022-11-20 19:37       ` Anjali Kulkarni
  2022-11-21 18:28       ` Anjali Kulkarni
  1 sibling, 0 replies; 8+ messages in thread
From: Anjali Kulkarni @ 2022-11-20 19:37 UTC (permalink / raw)
  To: Greg KH; +Cc: linux-kernel

Thank you! I have fixed it.
Anjali

________________________________________
From: Greg KH <gregkh@linuxfoundation.org>
Sent: Sunday, November 20, 2022 12:21 AM
To: Anjali Kulkarni
Cc: linux-kernel@vger.kernel.org
Subject: Re: [RFC] Add a new generic system call which has better performance, to get /proc data, than existing mechanisms

Note, html email is rejected by kernel mailing lists, and top-posting
does not work at all in discussions.  Please fix your email client if
you wish to participate in kernel development.

On Sat, Nov 19, 2022 at 05:50:03PM +0000, Anjali Kulkarni wrote:
> I will give it a try, but the majority of the savings are due to avoiding the conversion from binary to string in /proc.

That goes contrary to your previous statement saying that the readfile
call would help out here.

And there might be ways to convert binary to strings, perhaps look into
doing that?

good luck,

greg k-h

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC] Add a new generic system call which has better performance, to get /proc data, than existing mechanisms
  2022-11-20  8:21     ` Greg KH
  2022-11-20 19:37       ` Anjali Kulkarni
@ 2022-11-21 18:28       ` Anjali Kulkarni
  1 sibling, 0 replies; 8+ messages in thread
From: Anjali Kulkarni @ 2022-11-21 18:28 UTC (permalink / raw)
  To: Greg KH; +Cc: linux-kernel


And there might be ways to convert binary to strings, perhaps look into
doing that?
ANJALI> I am not sure what you mean by that? 
Could I create a new /proc node which is not strings, but rather dumps a struct? That is what will help us the most. Not sure it will be acceptable solution to upstream.

Thanks
Anjali

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC] Add a new generic system call which has better performance, to get /proc data, than existing mechanisms
@ 2022-11-18 19:10 Anjali Kulkarni
  0 siblings, 0 replies; 8+ messages in thread
From: Anjali Kulkarni @ 2022-11-18 19:10 UTC (permalink / raw)
  To: gregkh, linux-kernel; +Cc: anjali.k.kulkarni



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC] Add a new generic system call which has better performance, to get /proc data, than existing mechanisms
  2022-11-16 18:37 Anjali Kulkarni
@ 2022-11-17 21:38 ` Greg KH
  0 siblings, 0 replies; 8+ messages in thread
From: Greg KH @ 2022-11-17 21:38 UTC (permalink / raw)
  To: Anjali Kulkarni; +Cc: linux-kernel, linux-api, linux-fsdevel

On Wed, Nov 16, 2022 at 10:37:24AM -0800, Anjali Kulkarni wrote:
> Currently, reading from /proc is an expensive operation, for a
> performance sensitive application

Then perhaps "performance sensitive applications" should not be reading
10's of thousands of /proc files?

Anyway, your proposal comes up every few years, in different ways.
Please research the past proposals for why this keeps failing and
perhaps you should just fix up your userspace code instead?

Also, look at attempts like the introduction of the readfile syscall as
well, if you want to remove the open/read/close set of syscalls into
one, but even that isn't all that useful for real-world applications, as
you can today use the io_uring api to achieve almost the same throughput
if really needed.

good luck!

greg k-h

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [RFC] Add a new generic system call which has better performance, to get /proc data, than existing mechanisms
@ 2022-11-16 18:37 Anjali Kulkarni
  2022-11-17 21:38 ` Greg KH
  0 siblings, 1 reply; 8+ messages in thread
From: Anjali Kulkarni @ 2022-11-16 18:37 UTC (permalink / raw)
  To: linux-kernel, linux-api, linux-fsdevel; +Cc: anjali.k.kulkarni

Currently, reading from /proc is an expensive operation, for a performance sensitive application – that is because the /proc files are in ascii and you need to convert from binary to ascii & vice versa. Add to that, if the application needs to read a huge number of files from /proc in a short amount of time, the time taken is very high and performance is affected for the application. We want to address this issue by creating a generic system call, which can be used to read any of the data in /proc, without going through /proc, but fetching the data from the kernel directly. Additionally, a vectored list of /proc items can be fetched in one system call to further save multiple system calls.
 
As an example of such an application at Oracle, Oracle Database is a multi-process/multi threaded architecture. The database maintains process state in the shared memory. One of the states it maintains and needs to get an update on periodically is the start time of multiple process(obtained by reading start time entry in /proc/pid/stat). However, reading start time from /proc of 1000s of pids is slow and consumes a lot of cpu time. We need a faster way to get the data we want, without going through /proc. For oracle DB, it needs this information every 1-3 seconds, with an average of 10K-64K processes per system, so over time, the savings in performance is very large if we use a vectored system call.
 
Another example of an application which can benefit from this is Performance Co-Pilot (PCP). The benefit can be obtained by optimizing the process related metrics code path. PCP does an opendir and readdir on /proc each time it wants to refresh the pid list. Then for each pid, it reads various files under /proc/pid to get metrics for the process, e.g the following proc files are read:

/proc/pid/status
/proc/pid/stat
/proc/pid/wchan
/proc/pid/smaps
/proc/pid/maps
/proc/pid/io

So the more number of processes are running on the system, this operation would be more costly because that many files have to be opened, etc.  If this information can be obtained using one (or fewer) system call(s) from the kernel, it will improve the performance.

Considering the above 2 applications, we will initially support the fields under /proc/pid/stat. Eventually, we will expand to the above listed /proc/pid values (needed for PCP). If other applications come into our view which need more fields, we can add those as cases arise.

Considering the granularity of the item to be requested from the kernel, we can fetch either the entire /proc/pid/<item> for eg. /proc/pid/stat from the kernel as a vectored list for multiple PIDs, or fetch each individual item from /proc/pid/stat like start time, as is useful for Oracle database. For this, we are thinking over and will eventually come up with an efficient mapping mechanism to map the /proc fields to numeric constants.

To be able to return values which are of different formats, we have returned a void pointer, which can point to an array of any type, for eg, an array of integers for start times, and could be an array of structs for something else which needs to return an array of multiple values. Both kernel and user can interpret the right structure based on the mapping between /proc and item being returned from the kernel.

To get some performance numbers, we have done a prototype of a system call to handle the database case, which is of immediate need to Oracle. The system call takes as input an array of PIDs and returns an array of start times. As a micro benchmark, compared the retrieval of 8000 PIDs with /proc directly, as compared to fetching from system call. Fetching from /proc directly, takes about 100ms for 8000 PIDs, and using system call approach we reduce this time to about 1ms for 8000 PIDs. The more the no. of PIDS, the more will be the savings.
 
This system call not only fetches data directly from the kernel (and not /proc), but also allows us to get a vectored list of the data items requested via one system call. Hence an array of requested /proc item is input to the system call, and an array of the value of that item is output from the system call.

As an example of how this call will be made is shown below :
 
ret = syscall(SYSCALL_NUMBER, PID_START_TIME, LEN, pidarr, stimes);
 
where,
PID_START_TIME = start time of any process, fetched from /proc/$pid/stat.
LEN = length of the array of the pids whose start times need to be fetched
pidarr = An input array of length LEN, which has a list of pids, whose start times need to be fetched
stimes = An output array from the system call, listing the start times of the pids.

Diff of prototype is shown below:

diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 10c74a2a45bb..ca69f6d38bfc 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -373,6 +373,7 @@
 # This one is a temporary number, designed for no clashes.
 # Nothing but DTrace should use it.
 473	common	waitfd			sys_waitfd
+474	common  get_vector		sys_get_vector
 
 #
 # Due to a historical design error, certain syscalls are numbered differently
diff --git a/fs/proc/array.c b/fs/proc/array.c
index 49be8c8ef555..91f4d73314c9 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -92,6 +92,7 @@
 #include <linux/string_helpers.h>
 #include <linux/user_namespace.h>
 #include <linux/fs_struct.h>
+#include <linux/syscalls.h>
 
 #include <asm/processor.h>
 #include "internal.h"
@@ -449,6 +450,89 @@ int proc_pid_status(struct seq_file *m, struct pid_namespace *ns,
 	return 0;
 }
 
+/*
+ * Get the start times of the array of PIDs given in pidarr
+ * If any PID is not found or there is an error for any one of the PIDs,
+ * indicate an error by returning start time for that PID as 0, and continue
+ * to the next PID
+ */
+void pids_start_time(int *pidarr, size_t len, unsigned long long *stimes)
+{
+	struct task_struct *task = NULL;
+	int i;
+	struct pid *pid = NULL;
+
+	for (i = 0; i < len; i++) {
+		pid = find_get_pid(pidarr[i]);
+		if (!pid) {
+			stimes[i] = 0;
+			continue;
+		}
+
+		task = get_pid_task(pid, PIDTYPE_PID);
+		if (!task) {
+			stimes[i] = 0;
+			continue;
+		}
+
+		if (task->pid == pidarr[i])
+			stimes[i] =
+				nsec_to_clock_t(timens_add_boottime_ns(task->start_boottime));
+		else
+			stimes[i] = 0;
+
+		put_task_struct(task);
+	}
+}
+
+asmlinkage long sys_get_vector(int op, size_t len, const void __user *arg,
+			       void *out_arr)
+{
+	size_t in_tsize, out_tsize;
+	int *in_karr;
+	unsigned long long *out_karr;
+
+	switch (op) {
+	case PID_START_TIME:
+		in_tsize = len * sizeof(int);
+		in_karr = (int *) kmalloc(in_tsize, GFP_KERNEL);
+		if (!in_karr)
+			return -ENOMEM;
+
+		if (copy_from_user(in_karr, arg, in_tsize)) {
+			goto free_in_karr;
+			return -EFAULT;
+		}
+
+		out_tsize = len * sizeof(unsigned long long);
+		out_karr = (unsigned long long *) kmalloc(out_tsize, GFP_KERNEL);
+		if (!out_karr) {
+			goto free_in_karr;
+			return -ENOMEM;
+		}
+
+		pids_start_time(in_karr, len, out_karr);
+		if (copy_to_user(out_arr, out_karr, out_tsize)) {
+			goto free_out_karr;
+			return -EFAULT;
+		}
+free_out_karr:
+		kfree(out_karr);
+free_in_karr:
+		kfree(in_karr);
+		break;
+	default:
+		return -EINVAL;
+	}
+	return 0;
+}
+
+SYSCALL_DEFINE4(get_vector, int, op, size_t, len, const void __user *, arg,
+		void *, out_arr)
+{
+	return sys_get_vector(op, len, arg, out_arr);
+}
+
 static int do_task_stat(struct seq_file *m, struct pid_namespace *ns,
 			struct pid *pid, struct task_struct *task, int whole)
 {
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index 03415f3fb3a8..1220c95b8f9e 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -74,6 +74,8 @@ struct proc_dir_entry {
 	0)
 #define SIZEOF_PDE_INLINE_NAME (SIZEOF_PDE - sizeof(struct proc_dir_entry))
 
+#define PID_START_TIME 1
+
 static inline bool pde_is_permanent(const struct proc_dir_entry *pde)
 {
 	return pde->flags & PROC_ENTRY_PERMANENT;
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index e92af9a8bbf8..abc0b74d2143 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -1384,4 +1384,6 @@ int __sys_getsockopt(int fd, int level, int optname, char __user *optval,
 		int __user *optlen);
 int __sys_setsockopt(int fd, int level, int optname, char __user *optval,
 		int optlen);
+asmlinkage long sys_get_vector(int op, size_t len, const void __user *arg, 
+			       void *outp);
 #endif
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index db285633c05b..8964d6326d9c 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -883,8 +883,11 @@ __SYSCALL(__NR_process_mrelease, sys_process_mrelease)
 #define __NR_waitfd 473
 __SYSCALL(__NR_waitfd, sys_waitfd)
 
+#define __NR_get_vector 474
+__SYSCALL(__NR_get_vector, sys_get_vector)
+
 #undef __NR_syscalls
-#define __NR_syscalls 474
+#define __NR_syscalls 475
 
 /*
  * 32 bit systems traditionally used different
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 64eb1931bf2a..80d8ac0144be 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -478,3 +478,4 @@ COND_SYSCALL(setuid16);
 
 /* restartable sequence */
 COND_SYSCALL(rseq);
+COND_SYSCALL(get_vector);

^ permalink raw reply related	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2022-11-21 18:28 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-11-18 19:12 [RFC] Add a new generic system call which has better performance, to get /proc data, than existing mechanisms Anjali Kulkarni
2022-11-19  8:49 ` Greg KH
     [not found]   ` <MN2PR10MB414411D0E29F20412E6DF0C0C4089@MN2PR10MB4144.namprd10.prod.outlook.com>
2022-11-20  8:21     ` Greg KH
2022-11-20 19:37       ` Anjali Kulkarni
2022-11-21 18:28       ` Anjali Kulkarni
  -- strict thread matches above, loose matches on Subject: below --
2022-11-18 19:10 Anjali Kulkarni
2022-11-16 18:37 Anjali Kulkarni
2022-11-17 21:38 ` Greg KH

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).