timerfd waking up before timer really expires

* timerfd waking up before timer really expires
@ 2015-03-03 19:59 Lucas De Marchi
  2015-03-04  8:06 ` Clemens Ladisch
  0 siblings, 1 reply; 3+ messages in thread
From: Lucas De Marchi @ 2015-03-03 19:59 UTC (permalink / raw)
  To: lkml; +Cc: Thomas Gleixner, Steven Rostedt

Hi,

I was debugging my application and noticed that a timerfd event was being
triggered *before* the timer expires. I'd like to know if this behavior is
expected. More details below.

I reduced the scope of the program to test a single timerfd and measure the
difference in the result of clock_gettime() between two reads. Helper functions
are elsewhere, but the loop is very simple:

    loop_time_fd = setup_timerfd(interval, 0);
    iter = 0;
    old = curr = now_usec();
    do {
            if (iter > iterations)
                    break;

            r = read(loop_time_fd, &events, sizeof(events));
            if (r < 0) {
                    if (errno == EINTR)
                            continue;
                    return log_error_errno(errno, "Error reading timerfd: %m");
            }

            curr = now_usec();

            /* mark as USEC_INFINITY those events we missed, the real wait
             * time is nonetheless in the next position */
            for (; events > 1; events--)
                    elapsed[iter++] = USEC_INFINITY;

            elapsed[iter++] = curr - old;
            old = curr;
    } while (1);

Here we have:
  - now_usec() simply calls clock_gettime() with CLOCK_MONOTONIC as clockid and
    converts to usec.
  - setup_timerfd() creates the timerfd with CLOCK_MONOTONIC and let it
    in blocking mode  (there's another version of the test with poll()
and non-block fd,
    but it shows the same behavior)

First I thought it could be because of the timerslack. I disabled it with
prctl(). I'm also running the process with RT priority so AFAIK there shouldn't
be a problem with timerslack. In order to reduce the overhead for the process I
also tried to isolate it alone on a single cpu by booting with isolcpus and
then setting the affinity accordingly.

For whatever interval I configure and 10000 iterations, what I'm seeing in the
elapsed vector are values like

    interval +- 70usec

Due to the wakeup overhead, I understand the +70usec, but I'm not getting why
there are elapsed values shorter than the configured interval.

Tracing for wakeup events with ftrace[1] I also get the same results... the time
difference between 2 wakeup events for ~30% of events are shorter the
configured interval.

I'm currently running with kernel version 3.18.6. Is there anything I'm
missing?

thanks

-- 
Lucas De Marchi

[1] trace-cmd record -F -e sched:sched_wakeup ./test-timer 1000 10000

^ permalink raw reply	[flat|nested] 3+ messages in thread