From: "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com>
To: Thomas Gleixner <tglx@linutronix.de>, "devi R.K" <devi.feb27@gmail.com>
Cc: mtk.manpages@gmail.com, linux-man@vger.kernel.org,
lkml <linux-kernel@vger.kernel.org>,
arul.jeniston@gmail.com
Subject: Re: [PATCH] timerfd_create.2: Included return value 0
Date: Mon, 30 Mar 2020 23:09:24 +0200 [thread overview]
Message-ID: <d79b3520-77ac-199e-6576-ab9f235ac297@gmail.com> (raw)
In-Reply-To: <87a73ywzbv.fsf@nanos.tec.linutronix.de>
Hello Thomas,
On 3/30/20 12:50 AM, Thomas Gleixner wrote:
> Micheal,
>
> "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com> writes:
>> [Greetings, Thomas; now I recall a conversation we had in Lyon :-) ]
>
> Hehe.
>
>> I think this patch does not really capture the details
>> properly. The immediately preceding paragraph says:
>>
>> If the associated clock is either CLOCK_REALTIME or
>> CLOCK_REALTIME_ALARM, the timer is absolute
>> (TFD_TIMER_ABSTIME), and the flag TFD_TIMER_CANCEL_ON_SET
>> was specified when calling timerfd_settime(), then read(2)
>> fails with the error ECANCELED if the real-time clock
>> undergoes a discontinuous change. (This allows the reading
>> application to discover such discontinuous changes to the
>> clock.)
>>
>> Following on from that, I think we should have a pargraph that says
>> something like:
>>
>> If the associated clock is either CLOCK_REALTIME or
>> CLOCK_REALTIME_ALARM, the timer is absolute
>> (TFD_TIMER_ABSTIME), and the flag TFD_TIMER_CANCEL_ON_SET
>> was not specified when calling timerfd_settime(), then a
>> discontinuous negative change to the clock
>> (e.g., clock_settime(2)) may cause read(2) to unblock, but
>> return a value of 0 (i.e., no bytes read), if the clock
>> change occurs after the time expired, but before the
>> read(2) on the timerfd file descriptor.
>
> Yes, that's correct. Accurate as always!
Thanks. (It took me a while to nut it out, actually.)
> This is pretty much in line with clock_nanosleep(CLOCK_REALTIME,
> TIMER_ABSTIME) which has a similar problem vs. observability in user
> space.
>
> clock_nanosleep(2) mutters:
>
> "POSIX.1 specifies that after changing the value of the CLOCK_REALTIME
> clock via clock_settime(2), the new clock value shall be used to
> determine the time at which a thread blocked on an absolute
> clock_nanosleep() will wake up; if the new clock value falls past the
> end of the sleep interval, then the clock_nanosleep() call will return
> immediately."
>
> which can be interpreted as guarantee that clock_nanosleep() never
> returns prematurely,
<nod>
> i.e. the assert() in the below code would indicate
> a kernel failure:
>
> ret = clock_nanosleep(CLOCK_REALTIME, TIMER_ABSTIME, &expiry, NULL);
> if (!ret) {
> clock_gettime(CLOCK_REALTIME, &now);
> assert(now >= expiry);
> }
>
> But that assert can trigger when CLOCK_REALTIME was modified after the
> timer fired and the kernel decided to wake up the task and let it return
> to user space.
>
> clock_nanosleep(..., &expiry)
> arm_timer(expires);
> schedule();
>
> -> timer interrupt
> now = ktime_get_real();
> if (expires <= now)
> -------------------------------- After this point
> wakeup(); clock_settime(2) or
> adjtimex(2) which
> makes CLOCK_REALTIME
> jump back far enough will
> cause the above assert
> to trigger.
>
> ...
> return from syscall (retval == 0)
>
> There is no guarantee against clock_settime() coming after the
> wakeup. Even if we put another check into the return to user path then
> we won't catch a clock_settime() which comes right after that and before
> user space invokes clock_gettime().
<nod>
> POSIX spec Issue 7 (2018 edition) says:
>
> The suspension for the absolute clock_nanosleep() function (that is,
> with the TIMER_ABSTIME flag set) shall be in effect at least until the
> value of the corresponding clock reaches the absolute time specified by
> rqtp.
>
> And that's what the kernel implements for clock_nanosleep() and timerfd
> behaves exactly the same way.
>
> The wakeup of the waiter, i.e. task blocked in clock_nanosleep(2),
> read(2), poll(2), is not happening _before_ the absolute time specified
> is reached.
>
> If clock_settime() happens right before the expiry check, then it does
> the right thing, but any modification to the clock after the wakeup
> cannot be mitigated. At least not in a way which would make the assert()
> in the example code above a reliable indicator for a kernel fail.
>
> That's the reason why I rejected the attempt to mitigate that particular
> 0 tick issue in timerfd as it would just scratch a particular itch but
> still not provide any guarantee. So having the '0' return documented is
> the right way to go.
Thanks for the incredibly detailed follow-up Thomas (which all
landed in my commit message). I've applied a patch almost exactly
the same as the text I showed above (and it's pushed to Git).
All of 2020 is a bust, I expect. Perhaps we see us at a conference
in 2021 ;-).
Cheers,
Michael
--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
next prev parent reply other threads:[~2020-03-30 21:09 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <CAJymdbxJNag1W0vR9Ysm7_y91HwLAu4QaSMKZbed4emT1DYvww@mail.gmail.com>
2020-03-26 8:45 ` [PATCH] timerfd_create.2: Included return value 0 Michael Kerrisk (man-pages)
[not found] ` <CAJymdbwfm7EypQfXRqWZHbfj2SKk7PCP7SMccz6bmJWSauDqPQ@mail.gmail.com>
[not found] ` <CAJymdby8pDASq5Xv7DeTJ5cq1NXPe1jUWwApxZ5o-huaEXUrjw@mail.gmail.com>
2020-03-29 21:06 ` Michael Kerrisk (man-pages)
2020-03-29 22:50 ` Thomas Gleixner
2020-03-30 21:09 ` Michael Kerrisk (man-pages) [this message]
2020-03-30 21:29 ` Thomas Gleixner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=d79b3520-77ac-199e-6576-ab9f235ac297@gmail.com \
--to=mtk.manpages@gmail.com \
--cc=arul.jeniston@gmail.com \
--cc=devi.feb27@gmail.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-man@vger.kernel.org \
--cc=tglx@linutronix.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).