Linux-man Archive on
 help / color / Atom feed
From: Thomas Gleixner <>
To: "Michael Kerrisk \(man-pages\)" <>,
	"devi R.K" <>
	lkml <>,
Subject: Re: [PATCH] timerfd_create.2: Included return value 0
Date: Mon, 30 Mar 2020 00:50:12 +0200
Message-ID: <> (raw)
In-Reply-To: <>


"Michael Kerrisk (man-pages)" <> writes:
> [Greetings, Thomas; now I recall a conversation we had in Lyon :-) ]


> I think this patch does not really capture the details
> properly. The immediately preceding paragraph says:
>          If  the  associated  clock  is  either  CLOCK_REALTIME   or
>          CLOCK_REALTIME_ALARM,     the     timer     is     absolute
>          (TFD_TIMER_ABSTIME), and the  flag  TFD_TIMER_CANCEL_ON_SET
>          was  specified when calling timerfd_settime(), then read(2)
>          fails with the  error  ECANCELED  if  the  real-time  clock
>          undergoes a discontinuous change.  (This allows the reading
>          application to discover such discontinuous changes  to  the
>          clock.)
> Following on from that, I think we should have a pargraph that says
> something like:
>          If  the  associated  clock  is  either  CLOCK_REALTIME   or
>          CLOCK_REALTIME_ALARM,     the     timer     is     absolute
>          (TFD_TIMER_ABSTIME), and the  flag  TFD_TIMER_CANCEL_ON_SET
>          was not specified when calling timerfd_settime(), then a
>          discontinuous negative change to the clock 
>          (e.g., clock_settime(2)) may cause read(2) to unblock, but
>          return a value of 0 (i.e., no bytes read), if the clock
>          change occurs after the time expired, but before the
>          read(2) on the timerfd file descriptor.

Yes, that's correct. Accurate as always!

This is pretty much in line with clock_nanosleep(CLOCK_REALTIME,
TIMER_ABSTIME) which has a similar problem vs. observability in user

clock_nanosleep(2) mutters:

  "POSIX.1 specifies that after changing the value of the CLOCK_REALTIME
   clock via clock_settime(2), the new clock value shall be used to
   determine the time at which a thread blocked on an absolute
   clock_nanosleep() will wake up; if the new clock value falls past the
   end of the sleep interval, then the clock_nanosleep() call will return

which can be interpreted as guarantee that clock_nanosleep() never
returns prematurely, i.e. the assert() in the below code would indicate
a kernel failure:

   ret = clock_nanosleep(CLOCK_REALTIME, TIMER_ABSTIME, &expiry, NULL);
   if (!ret) {
         clock_gettime(CLOCK_REALTIME, &now);
         assert(now >= expiry);

But that assert can trigger when CLOCK_REALTIME was modified after the
timer fired and the kernel decided to wake up the task and let it return
to user space.

   clock_nanosleep(..., &expiry)

   -> timer interrupt
      now = ktime_get_real();
      if (expires <= now)
              -------------------------------- After this point
         wakeup();                             clock_settime(2) or
                                               adjtimex(2) which
                                               makes CLOCK_REALTIME
                                               jump back far enough will
                                               cause the above assert
                                               to trigger.

   return from syscall (retval == 0)

There is no guarantee against clock_settime() coming after the
wakeup. Even if we put another check into the return to user path then
we won't catch a clock_settime() which comes right after that and before
user space invokes clock_gettime().

POSIX spec Issue 7 (2018 edition) says:

 The suspension for the absolute clock_nanosleep() function (that is,
 with the TIMER_ABSTIME flag set) shall be in effect at least until the
 value of the corresponding clock reaches the absolute time specified by

And that's what the kernel implements for clock_nanosleep() and timerfd
behaves exactly the same way.

The wakeup of the waiter, i.e. task blocked in clock_nanosleep(2),
read(2), poll(2), is not happening _before_ the absolute time specified
is reached.

If clock_settime() happens right before the expiry check, then it does
the right thing, but any modification to the clock after the wakeup
cannot be mitigated. At least not in a way which would make the assert()
in the example code above a reliable indicator for a kernel fail.

That's the reason why I rejected the attempt to mitigate that particular
0 tick issue in timerfd as it would just scratch a particular itch but
still not provide any guarantee. So having the '0' return documented is
the right way to go.



  reply index

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <>
2020-03-26  8:45 ` Michael Kerrisk (man-pages)
     [not found]   ` <>
     [not found]     ` <>
2020-03-29 21:06       ` Michael Kerrisk (man-pages)
2020-03-29 22:50         ` Thomas Gleixner [this message]
2020-03-30 21:09           ` Michael Kerrisk (man-pages)
2020-03-30 21:29             ` Thomas Gleixner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \ \ \ \ \ \ \ \

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-man Archive on

Archives are clonable:
	git clone --mirror linux-man/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-man linux-man/ \
	public-inbox-index linux-man

Example config snippet for mirrors

Newsgroup available over NNTP:

AGPL code for this site: git clone