On 2011-07-13 21:12, Gilles Chanteperdrix wrote: > On 07/13/2011 09:04 PM, Jan Kiszka wrote: >> On 2011-07-13 20:39, Gilles Chanteperdrix wrote: >>> On 07/12/2011 07:43 PM, Jan Kiszka wrote: >>>> On 2011-07-12 19:38, Gilles Chanteperdrix wrote: >>>>> On 07/12/2011 07:34 PM, Jan Kiszka wrote: >>>>>> On 2011-07-12 19:31, Gilles Chanteperdrix wrote: >>>>>>> On 07/12/2011 02:57 PM, Jan Kiszka wrote: >>>>>>>> xnlock_put_irqrestore(&nklock, s); >>>>>>>> xnpod_schedule(); >>>>>>>> } >>>>>>>> @@ -1036,6 +1043,7 @@ redo: >>>>>>>> * to process this signal anyway. >>>>>>>> */ >>>>>>>> if (rthal_current_domain == rthal_root_domain) { >>>>>>>> + XENO_BUGON(NUCLEUS, xnthread_test_info(thread, XNATOMIC)); >>>>>>> >>>>>>> Misleading dead code again, XNATOMIC is cleared not ten lines above. >>>>>> >>>>>> Nope, I forgot to remove that line. >>>>>> >>>>>>> >>>>>>>> if (XENO_DEBUG(NUCLEUS) && (!signal_pending(this_task) >>>>>>>> || this_task->state != TASK_RUNNING)) >>>>>>>> xnpod_fatal >>>>>>>> @@ -1044,6 +1052,8 @@ redo: >>>>>>>> return -ERESTARTSYS; >>>>>>>> } >>>>>>>> >>>>>>>> + xnthread_clear_info(thread, XNATOMIC); >>>>>>> >>>>>>> Why this? I find the xnthread_clear_info(XNATOMIC) right at the right >>>>>>> place at the point it currently is. >>>>>> >>>>>> Nope. Now we either clear XNATOMIC after successful migration or when >>>>>> the signal is about to be sent (ie. in the hook). That way we can test >>>>>> more reliably (TM) in the gatekeeper if the thread can be migrated. >>>>> >>>>> Ok for adding the XNATOMIC test, because it improves the robustness, but >>>>> why changing the way XNATOMIC is set and clear? Chances of breaking >>>>> thing while changing code in this area are really high... >>>> >>>> The current code is (most probably) broken as it does not properly >>>> synchronizes the gatekeeper against a signaled and "runaway" target >>>> Linux task. >>>> >>>> We need an indication if a Linux signal will (or already has) woken up >>>> the to-be-migrated task. That task may have continued over its context, >>>> potentially on a different CPU. Providing this indication is the purpose >>>> of changing where XNATOMIC is cleared. >>> >>> What about synchronizing with the gatekeeper with a semaphore, as done >>> in the first patch you sent, but doing it in xnshadow_harden, as soon as >>> we detect that we are not back from schedule in primary mode? It seems >>> it would avoid any further issue, as we would then be guaranteed that >>> the thread could not switch to TASK_INTERRUPTIBLE again before the >>> gatekeeper is finished. >> >> The problem is that the gatekeeper tests the task state without holding >> the task's rq lock (which is not available to us without a kernel >> patch). That cannot work reliably as long as we accept signals. That's >> why I'm trying to move state change and test under nklock. >> >>> >>> What worries me is the comment in xnshadow_harden: >>> >>> * gatekeeper sent us to primary mode. Since >>> * TASK_UNINTERRUPTIBLE is unavailable to us without wrecking >>> * the runqueue's count of uniniterruptible tasks, we just >>> * notice the issue and gracefully fail; the caller will have >>> * to process this signal anyway. >>> */ >>> >>> Does this mean that we can not switch to TASK_UNINTERRUPTIBLE at this >>> point? Or simply that TASK_UNINTERRUPTIBLE is not available for the >>> business of xnshadow_harden? >>> >> >> TASK_UNINTERRUPTIBLE is not available without patching the kernel's >> scheduler for the reason mentioned in the comment (the scheduler becomes >> confused and may pick the wrong tasks, IIRC). > > Does not using down/up in the taskexit event handler risk to cause the > same issue? Yes, and that means the first patch is incomplete without something like the second. > >> >> But I would refrain from trying to "improve" the gatekeeper design. I've >> recently mentioned this to Philippe offlist: For Xenomai 3 with some >> ipipe v3, we must rather patch schedule() to enable zero-switch domain >> migration. Means: enter the scheduler, let it suspend current and pick >> another task, but then simply escalate to the RT domain before doing any >> context switch. That's much cheaper than the current design and >> hopefully also less error-prone. > > So, do you want me to merge your for-upstream branch? You may merge up to for-upstream^, ie. without any gatekeeper fixes. I strongly suspect that there are still more races in the migration path. The crashes we face even with all patches applied may be related to a shadow task being executed under Linux and Xenomai at the same time. Jan