Re: [PATCH] timerfd_create.2: Included return value 0 - Michael Kerrisk (man-pages)

From: "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com>
To: Thomas Gleixner <tglx@linutronix.de>, "devi R.K" <devi.feb27@gmail.com>
Cc: mtk.manpages@gmail.com, linux-man@vger.kernel.org,
	lkml <linux-kernel@vger.kernel.org>,
	arul.jeniston@gmail.com
Subject: Re: [PATCH] timerfd_create.2: Included return value 0
Date: Mon, 30 Mar 2020 23:09:24 +0200	[thread overview]
Message-ID: <d79b3520-77ac-199e-6576-ab9f235ac297@gmail.com> (raw)
In-Reply-To: <87a73ywzbv.fsf@nanos.tec.linutronix.de>

Hello Thomas,

On 3/30/20 12:50 AM, Thomas Gleixner wrote:
> Micheal,
> 
> "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com> writes:
>> [Greetings, Thomas; now I recall a conversation we had in Lyon :-) ]
> 
> Hehe.
> 
>> I think this patch does not really capture the details
>> properly. The immediately preceding paragraph says:
>>
>>          If  the  associated  clock  is  either  CLOCK_REALTIME   or
>>          CLOCK_REALTIME_ALARM,     the     timer     is     absolute
>>          (TFD_TIMER_ABSTIME), and the  flag  TFD_TIMER_CANCEL_ON_SET
>>          was  specified when calling timerfd_settime(), then read(2)
>>          fails with the  error  ECANCELED  if  the  real-time  clock
>>          undergoes a discontinuous change.  (This allows the reading
>>          application to discover such discontinuous changes  to  the
>>          clock.)
>>
>> Following on from that, I think we should have a pargraph that says
>> something like:
>>
>>          If  the  associated  clock  is  either  CLOCK_REALTIME   or
>>          CLOCK_REALTIME_ALARM,     the     timer     is     absolute
>>          (TFD_TIMER_ABSTIME), and the  flag  TFD_TIMER_CANCEL_ON_SET
>>          was not specified when calling timerfd_settime(), then a
>>          discontinuous negative change to the clock 
>>          (e.g., clock_settime(2)) may cause read(2) to unblock, but
>>          return a value of 0 (i.e., no bytes read), if the clock
>>          change occurs after the time expired, but before the
>>          read(2) on the timerfd file descriptor.
> 
> Yes, that's correct. Accurate as always!

Thanks. (It took me a while to nut it out, actually.)

> This is pretty much in line with clock_nanosleep(CLOCK_REALTIME,
> TIMER_ABSTIME) which has a similar problem vs. observability in user
> space.
> 
> clock_nanosleep(2) mutters:
> 
>   "POSIX.1 specifies that after changing the value of the CLOCK_REALTIME
>    clock via clock_settime(2), the new clock value shall be used to
>    determine the time at which a thread blocked on an absolute
>    clock_nanosleep() will wake up; if the new clock value falls past the
>    end of the sleep interval, then the clock_nanosleep() call will return
>    immediately."
> 
> which can be interpreted as guarantee that clock_nanosleep() never
> returns prematurely, 

<nod>

> i.e. the assert() in the below code would indicate
> a kernel failure:
> 
>    ret = clock_nanosleep(CLOCK_REALTIME, TIMER_ABSTIME, &expiry, NULL);
>    if (!ret) {
>          clock_gettime(CLOCK_REALTIME, &now);
>          assert(now >= expiry);
>    }
> 
> But that assert can trigger when CLOCK_REALTIME was modified after the
> timer fired and the kernel decided to wake up the task and let it return
> to user space.
> 
>    clock_nanosleep(..., &expiry)
>      arm_timer(expires);
>      schedule();
> 
>    -> timer interrupt
>       now = ktime_get_real();
>       if (expires <= now)
>               -------------------------------- After this point
>          wakeup();                             clock_settime(2) or
>                                                adjtimex(2) which
>                                                makes CLOCK_REALTIME
>                                                jump back far enough will
>                                                cause the above assert
>                                                to trigger.
> 
>    ...
>    return from syscall (retval == 0)
> 
> There is no guarantee against clock_settime() coming after the
> wakeup. Even if we put another check into the return to user path then
> we won't catch a clock_settime() which comes right after that and before
> user space invokes clock_gettime().

<nod>

> POSIX spec Issue 7 (2018 edition) says:
> 
>  The suspension for the absolute clock_nanosleep() function (that is,
>  with the TIMER_ABSTIME flag set) shall be in effect at least until the
>  value of the corresponding clock reaches the absolute time specified by
>  rqtp.
> 
> And that's what the kernel implements for clock_nanosleep() and timerfd
> behaves exactly the same way.
> 
> The wakeup of the waiter, i.e. task blocked in clock_nanosleep(2),
> read(2), poll(2), is not happening _before_ the absolute time specified
> is reached.
> 
> If clock_settime() happens right before the expiry check, then it does
> the right thing, but any modification to the clock after the wakeup
> cannot be mitigated. At least not in a way which would make the assert()
> in the example code above a reliable indicator for a kernel fail.
> 
> That's the reason why I rejected the attempt to mitigate that particular
> 0 tick issue in timerfd as it would just scratch a particular itch but
> still not provide any guarantee. So having the '0' return documented is
> the right way to go.

Thanks for the incredibly detailed follow-up Thomas (which all 
landed in my commit message). I've applied a patch almost exactly
the same as the text I showed above (and it's pushed to Git).

All of 2020 is a bust, I expect. Perhaps we see us at a conference
in 2021 ;-).

Cheers,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/