Kernel Newbies Archive on lore.kernel.org
 help / color / Atom feed
* pid start time and new display field in proc pid stat
@ 2021-03-25 12:53 Navin P
  2021-03-25 15:20 ` jim.cromie
  0 siblings, 1 reply; 6+ messages in thread
From: Navin P @ 2021-03-25 12:53 UTC (permalink / raw)
  To: kernelnewbies

Hi,

 As of 5.11 kernel (pid,pid_start_time) is not unique /monotonic even
though the underlying counters are .
 I chose start_boottime because i wanted the counter to increase
during suspend as well.

1. Is there any case where task->start_boottime or
ktime_get_boottime_ns doesn't become monotonic i.e increasing ?

2.  If start_boottime is not monotonic which counter to use ?

3.  If i create a new field in task_struct , then i can use a
atomic_add_return(1,&v) to fill in the task->new_field. Will this also
work ?

By doing this <pid,pid_start_time> becomes unique.

 In linux/fs/proc/array.c at line 566 we have

  /* apply timens offset for boottime and convert nsec -> ticks */
start_time =
nsec_to_clock_t(timens_add_boottime_ns(task->start_boottime));
task->start_boottime is a monotonic increasing counter fetched from
ktime_get_boottime_ns in fork.c

nsec_to_clock_t contains div_u64 due to which we loose some lower
bits/digits  on divison
and is not unique unless the divisor is 1.

 if CONFIG_HZ = 250 and nsec_to_clock_t x=4000001 , then

#if (NSEC_PER_SEC % USER_HZ) == 0
return div_u64(x, NSEC_PER_SEC / USER_HZ);

becomes div_u64(x, 4000000) then  return value is 1
when x=4000002, return value is 1
until x=8000000 which returns 2.

The value shown in /proc/[pid]/stat is actually the truncated value.

Hence i'm planning to display a counter at the end of /proc/[pid]/stat
 as the 53rd field.

I've prepared a patch as inlined.

From a2c6b5d6435394f015d38700008ff74f16dfa5fd Mon Sep 17 00:00:00 2001
From: Navin P <navinp0304@gmail.com>
Date: Thu, 25 Mar 2021 15:27:30 +0530
Subject: [PATCH] Display task->start_boottime as 53rd field in
 /proc/[pid]/stat.  The 22nd field start_time currently shown in
 /proc/[pid]/stat as start_time is truncated by division.Hence it is not
 unique .

Signed-off-by: Navin P <navinp0304@gmail.com>
---
 Documentation/filesystems/proc.rst | 1 +
 fs/proc/array.c                    | 1 +
 2 files changed, 2 insertions(+)

diff --git a/Documentation/filesystems/proc.rst
b/Documentation/filesystems/proc.rst
index 48fbfc336ebf..3b7a1543b2c0 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -381,6 +381,7 @@ It's slow but very precise.
   env_end       address below which program environment is placed
   exit_code     the thread's exit_code in the form reported by the waitpid
  system call
+  start_boottime the process start time in nanoseconds since boot
   ============= ===============================================================

 The /proc/PID/maps file contains the currently mapped memory regions and
diff --git a/fs/proc/array.c b/fs/proc/array.c
index bb87e4d89cd8..74389aaefa9c 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -645,6 +645,7 @@ static int do_task_stat(struct seq_file *m, struct
pid_namespace *ns,
  else
  seq_puts(m, " 0");

+ seq_put_decimal_ull(m, " ", task->start_boottime);
  seq_putc(m, '\n');
  if (mm)
  mmput(mm);
-- 
2.25.1

_______________________________________________
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
https://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: pid start time and new display field in proc pid stat
  2021-03-25 12:53 pid start time and new display field in proc pid stat Navin P
@ 2021-03-25 15:20 ` jim.cromie
  2021-03-26  3:01   ` Navin P
  0 siblings, 1 reply; 6+ messages in thread
From: jim.cromie @ 2021-03-25 15:20 UTC (permalink / raw)
  To: Navin P; +Cc: kernelnewbies

[-- Attachment #1.1: Type: text/plain, Size: 968 bytes --]

On Thu, Mar 25, 2021 at 6:53 AM Navin P <navinp0304@gmail.com> wrote:

> Hi,
>
>  As of 5.11 kernel (pid,pid_start_time) is not unique /monotonic even
> though the underlying counters are .
>  I chose start_boottime because i wanted the counter to increase
> during suspend as well.
>
> 1. Is there any case where task->start_boottime or
> ktime_get_boottime_ns doesn't become monotonic i.e increasing ?
>
> 2.  If start_boottime is not monotonic which counter to use ?
>
> 3.  If i create a new field in task_struct , then i can use a
> atomic_add_return(1,&v) to fill in the task->new_field. Will this also
> work ?
>
>
Its my understanding that task-struct is a highly "contended" resource.

its a basic element, its size matters
everybody wants a bit/byte for something special,
conflicts ensue (or could).
nobody gets them without a real good reason.

I dont know what youre trying to achieve, but i suspect that
youll need to find a more subtle way of doing it.

[-- Attachment #1.2: Type: text/html, Size: 1471 bytes --]

[-- Attachment #2: Type: text/plain, Size: 170 bytes --]

_______________________________________________
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
https://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: pid start time and new display field in proc pid stat
  2021-03-25 15:20 ` jim.cromie
@ 2021-03-26  3:01   ` Navin P
  2021-04-01 14:50     ` pid start time and new display field in proc pid stat [ race in task->start_boottime ,start_time] Navin P
  0 siblings, 1 reply; 6+ messages in thread
From: Navin P @ 2021-03-26  3:01 UTC (permalink / raw)
  To: jim.cromie; +Cc: kernelnewbies

On Thu, Mar 25, 2021 at 8:51 PM <jim.cromie@gmail.com> wrote:
>
>
>
> On Thu, Mar 25, 2021 at 6:53 AM Navin P <navinp0304@gmail.com> wrote:
>>
>> Hi,
>>
>>  As of 5.11 kernel (pid,pid_start_time) is not unique /monotonic even
>> though the underlying counters are .
>>  I chose start_boottime because i wanted the counter to increase
>> during suspend as well.
>>
>> 1. Is there any case where task->start_boottime or
>> ktime_get_boottime_ns doesn't become monotonic i.e increasing ?
>>
>> 2.  If start_boottime is not monotonic which counter to use ?
>>
>> 3.  If i create a new field in task_struct , then i can use a
>> atomic_add_return(1,&v) to fill in the task->new_field. Will this also
>> work ?
>>
>
> Its my understanding that task-struct is a highly "contended" resource.
>
> its a basic element, its size matters
> everybody wants a bit/byte for something special,
> conflicts ensue (or could).
> nobody gets them without a real good reason.
>
> I dont know what youre trying to achieve, but i suspect that
> youll need to find a more subtle way of doing it.


Yes creating a new field is not appropriate. I was checking on the
other options in case a similar problem occurs, like other ways of
solving this problem.

My patch doesn't create a new field, it uses the existing field and
prints it in /proc/[pid]/stat. Stat files are created
on first lookup. These are used by ps or top.

+ seq_put_decimal_ull(m, " ", task->start_boottime);

_______________________________________________
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
https://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: pid start time and new display field in proc pid stat [ race in task->start_boottime ,start_time]
  2021-03-26  3:01   ` Navin P
@ 2021-04-01 14:50     ` Navin P
  2021-04-03  3:21       ` pid start time and new display field in proc pid stat [ race in task->start_boottime , start_time] Valdis Klētnieks
  0 siblings, 1 reply; 6+ messages in thread
From: Navin P @ 2021-04-01 14:50 UTC (permalink / raw)
  To: kernelnewbies

On Fri, Mar 26, 2021 at 8:31 AM Navin P <navinp0304@gmail.com> wrote:
>
> On Thu, Mar 25, 2021 at 8:51 PM <jim.cromie@gmail.com> wrote:
> >
> >
> >
> > On Thu, Mar 25, 2021 at 6:53 AM Navin P <navinp0304@gmail.com> wrote:
> >>
> >> Hi,
> >>
> >>  As of 5.11 kernel (pid,pid_start_time) is not unique /monotonic even
> >> though the underlying counters are .
> >>  I chose start_boottime because i wanted the counter to increase
> >> during suspend as well.
> >>
> My patch doesn't create a new field, it uses the existing field and
> prints it in /proc/[pid]/stat. Stat files are created
> on first lookup. These are used by ps or top.
>
> + seq_put_decimal_ull(m, " ", task->start_boottime);

After printing the task->start_boottime as 53rd field, i found
multiple process containing same value due to a race. The race exists
for p->start_time as well.
Working on the above  i found a race in the kernel/fork.c at
p->start_boottime = ktime_get_boottime_ns();  i logged the output and
compared them to find identical values for many process.

I made a patch and this fixes the issue but i'm not happy with the
change because it locks the tasklist_lock which is a big lock.

What are the other ways  that can be used to fix this problem ?
I was thinking of changing start_boottime as struct start_boottime {
u64 boottime, rw_lock };
Now all my access to boottime go through the rw_lock . Another rwlock
has to be created for start_time as well.


From a048ac81b468ec94cb10ca41416cfe9641be3c5b Mon Sep 17 00:00:00 2001
From: Navin P <navinp0304@gmail.com>
Date: Thu, 1 Apr 2021 20:04:24 +0530
Subject: [PATCH] Wrap task->start_boottime in write_lock_irq during creation
 and filling leader->task_boottime in fs/exec.c . Also wrap read_lock in
 /proc/fs/array.c for reading task->start_boottime to remove the race.

Signed-off-by: Navin P <navinp0304@gmail.com>
---
 fs/exec.c       | 2 ++
 fs/proc/array.c | 8 ++++++--
 kernel/fork.c   | 6 +++++-
 3 files changed, 13 insertions(+), 3 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 18594f11c31f..6e29698bdd90 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1109,7 +1109,9 @@ static int de_thread(struct task_struct *tsk)
  * also take its birthdate (always earlier than our own).
  */
  tsk->start_time = leader->start_time;
+ write_lock_irq(&tasklist_lock);
  tsk->start_boottime = leader->start_boottime;
+ write_unlock_irq(&tasklist_lock);

  BUG_ON(!same_thread_group(leader, tsk));
  /*
diff --git a/fs/proc/array.c b/fs/proc/array.c
index 74389aaefa9c..dae7e973b9de 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -469,7 +469,7 @@ static int do_task_stat(struct seq_file *m, struct
pid_namespace *ns,
  int num_threads = 0;
  int permitted;
  struct mm_struct *mm;
- unsigned long long start_time;
+ unsigned long long start_time,curr_boottime;
  unsigned long cmin_flt = 0, cmaj_flt = 0;
  unsigned long  min_flt = 0,  maj_flt = 0;
  u64 cutime, cstime, utime, stime;
@@ -563,8 +563,12 @@ static int do_task_stat(struct seq_file *m,
struct pid_namespace *ns,
  nice = task_nice(task);

  /* apply timens offset for boottime and convert nsec -> ticks */
+
+        read_lock(&tasklist_lock);
  start_time =
  nsec_to_clock_t(timens_add_boottime_ns(task->start_boottime));
+ curr_boottime = task->start_boottime;
+ read_unlock(&tasklist_lock);

  seq_put_decimal_ull(m, "", pid_nr_ns(pid, ns));
  seq_puts(m, " (");
@@ -645,7 +649,7 @@ static int do_task_stat(struct seq_file *m, struct
pid_namespace *ns,
  else
  seq_puts(m, " 0");

- seq_put_decimal_ull(m, " ", task->start_boottime);
+ seq_put_decimal_ull(m, " ", curr_boottime);
  seq_putc(m, '\n');
  if (mm)
  mmput(mm);
diff --git a/kernel/fork.c b/kernel/fork.c
index 426cd0c51f9e..41728b9bb76d 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2219,11 +2219,15 @@ static __latent_entropy struct task_struct
*copy_process(
  * communication until we take the tasklist-lock. In particular, we do
  * not want user-space to be able to predict the process start-time by
  * stalling fork(2) after we recorded the start_time but before it is
- * visible to the system.
+ * visible to the system. Quickly take write lock and fill in details
+         * preventing race.
  */

  p->start_time = ktime_get_ns();
+        write_lock_irq(&tasklist_lock);
  p->start_boottime = ktime_get_boottime_ns();
+ write_unlock_irq(&tasklist_lock);
+

  /*
  * Make it visible to the rest of the system, but dont wake it up yet.
-- 
2.25.1

_______________________________________________
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
https://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: pid start time and new display field in proc pid stat [ race in task->start_boottime , start_time]
  2021-04-01 14:50     ` pid start time and new display field in proc pid stat [ race in task->start_boottime ,start_time] Navin P
@ 2021-04-03  3:21       ` Valdis Klētnieks
  2021-04-03  8:47         ` pid start time and new display field in proc pid stat [ race in task->start_boottime ,start_time] Navin P
  0 siblings, 1 reply; 6+ messages in thread
From: Valdis Klētnieks @ 2021-04-03  3:21 UTC (permalink / raw)
  To: Navin P; +Cc: kernelnewbies

[-- Attachment #1.1: Type: text/plain, Size: 1037 bytes --]

On Thu, 01 Apr 2021 20:20:04 +0530, Navin P said:

> After printing the task->start_boottime as 53rd field, i found
> multiple process containing same value due to a race. The race exists
> for p->start_time as well.

Well, yeah.  There's no existing reason that the kernel shouldn't be able to
spawn processes on two CPUs in parallel (and initialize the start_time field to
the same value along the way).  There should be as little synchronizing locking
as possible, and if you need to do something that requires locking, you should
probably piggyback it on top of a code section that already has the appropriate
locks in place.

> What are the other ways  that can be used to fix this problem ?

Is there a particular reason why your code *needs* to have different unique
values for the start_time and start_boottime fields?  If your code needs
something unique per-process, the tuple (start_time, PID) *will* be unique
system-wide.  Or just PID.

In other words, what problem are you trying to solve by making the start_time
unique?


[-- Attachment #1.2: Type: application/pgp-signature, Size: 832 bytes --]

[-- Attachment #2: Type: text/plain, Size: 170 bytes --]

_______________________________________________
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
https://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: pid start time and new display field in proc pid stat [ race in task->start_boottime ,start_time]
  2021-04-03  3:21       ` pid start time and new display field in proc pid stat [ race in task->start_boottime , start_time] Valdis Klētnieks
@ 2021-04-03  8:47         ` Navin P
  0 siblings, 0 replies; 6+ messages in thread
From: Navin P @ 2021-04-03  8:47 UTC (permalink / raw)
  To: Valdis Klētnieks; +Cc: kernelnewbies

On Sat, Apr 3, 2021 at 8:51 AM Valdis Klētnieks <valdis.kletnieks@vt.edu> wrote:
>
> On Thu, 01 Apr 2021 20:20:04 +0530, Navin P said:
>
> > After printing the task->start_boottime as 53rd field, i found
> > multiple process containing same value due to a race. The race exists
> > for p->start_time as well.
>
> Well, yeah.  There's no existing reason that the kernel shouldn't be able to
> spawn processes on two CPUs in parallel (and initialize the start_time field to
> the same value along the way).  There should be as little synchronizing locking
> as possible, and if you need to do something that requires locking, you should
> probably piggyback it on top of a code section that already has the appropriate
> locks in place.
>
> > What are the other ways  that can be used to fix this problem ?
>
> Is there a particular reason why your code *needs* to have different unique
> values for the start_time and start_boottime fields?  If your code needs
> something unique per-process, the tuple (start_time, PID) *will* be unique
> system-wide.  Or just PID.
>
> In other words, what problem are you trying to solve by making the start_time
> unique?
>
Previously i was of the assumption that (start_time ,pid) is unique
but since pid can be recycled in certain scenarios  . When there is
shortage of free pids, existing pids can be recycled.

https://bugs.chromium.org/p/project-zero/issues/detail?id=1692

The field 22 in /proc/<pid>/stat is divided by CONFIG_HZ(250 or 1000
...), so we are losing bits. Even the counter task->start_boottime is
not unique probably for the reasons you specified and
ktime_get_boottime_ns() being not race free .  Idea is to log
(ns,pid,start_time) as key into sqlite or db for live monitoring or
collection of metrics.

task->start_boottime is not unique . Hence i had two choices, one is
create an atomic counter in struct_task and use that and other is to
lock accesses to task->start_boottime . The locking approach needs
lock on tasklist_lock which seems to be a little heavy  compared to
new atomic in task_struct . The atomic in task_struct also has a
hidden advantage that it will show you the cumulative number of
forks() done since the system booted. 53rd field of cat
/prof/self/stat  && sleep 5 && cat /proc/self/stat could catch a fork
run by daemon monitoring process.

If the locking approach needs to be taken, then apart from boottime
start_time also neds lock on tasklist_lock.

I've posted the previous locking based one. The below is for a new field.

From a19aadff6b615134379f978cf7afa2c6d99e88cd Mon Sep 17 00:00:00 2001
From: Navin P <navinp0304@gmail.com>
Date: Sat, 3 Apr 2021 13:31:36 +0530
Subject: [PATCH] Create a new atomic64 field namely task->tpid_counter in
 task_struct. Create a static atomic64 pid_counter in fork.c which will be
 incremented once for every successful call of copy_process.

Signed-off-by: Navin P <navinp0304@gmail.com>
---
 fs/proc/array.c       | 2 +-
 include/linux/sched.h | 2 +-
 kernel/fork.c         | 7 +++++++
 3 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/fs/proc/array.c b/fs/proc/array.c
index 74389aaefa9c..8dc917676845 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -645,7 +645,7 @@ static int do_task_stat(struct seq_file *m, struct
pid_namespace *ns,
  else
  seq_puts(m, " 0");

- seq_put_decimal_ull(m, " ", task->start_boottime);
+ seq_put_decimal_ull(m, " ", atomic64_read(&task->tpid_counter));
  seq_putc(m, '\n');
  if (mm)
  mmput(mm);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index ef00bb22164c..c1f8c19db6e2 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1370,7 +1370,7 @@ struct task_struct {
 #ifdef CONFIG_KRETPROBES
  struct llist_head               kretprobe_instances;
 #endif
-
+ atomic64_t                      tpid_counter;
  /*
  * New fields for task_struct should be added above here, so that
  * they are included in the randomized portion of task_struct.
diff --git a/kernel/fork.c b/kernel/fork.c
index 426cd0c51f9e..d2e865776f12 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -139,6 +139,8 @@ DEFINE_PER_CPU(unsigned long, process_counts) = 0;

 __cacheline_aligned DEFINE_RWLOCK(tasklist_lock);  /* outer */

+static atomic64_t pid_counter;
+
 #ifdef CONFIG_PROVE_RCU
 int lockdep_tasklist_lock_is_held(void)
 {
@@ -2225,6 +2227,11 @@ static __latent_entropy struct task_struct *copy_process(
  p->start_time = ktime_get_ns();
  p->start_boottime = ktime_get_boottime_ns();

+        /* pid_counter is the static global variable that increments each time
+         * and the previous value of pid_counter is returned.
+         */
+ atomic64_set(&p->tpid_counter,atomic64_fetch_add(1,&pid_counter));
+
  /*
  * Make it visible to the rest of the system, but dont wake it up yet.
  * Need tasklist lock for parent etc handling!
-- 
2.25.1

_______________________________________________
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
https://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, back to index

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-25 12:53 pid start time and new display field in proc pid stat Navin P
2021-03-25 15:20 ` jim.cromie
2021-03-26  3:01   ` Navin P
2021-04-01 14:50     ` pid start time and new display field in proc pid stat [ race in task->start_boottime ,start_time] Navin P
2021-04-03  3:21       ` pid start time and new display field in proc pid stat [ race in task->start_boottime , start_time] Valdis Klētnieks
2021-04-03  8:47         ` pid start time and new display field in proc pid stat [ race in task->start_boottime ,start_time] Navin P

Kernel Newbies Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/kernelnewbies/0 kernelnewbies/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 kernelnewbies kernelnewbies/ https://lore.kernel.org/kernelnewbies \
		kernelnewbies@kernelnewbies.org
	public-inbox-index kernelnewbies

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernelnewbies.kernelnewbies


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git