Hi Oleg, On Thu, Sep 10, 2015 at 07:55:57PM +0200, Oleg Nesterov wrote: > On 09/10, Boqun Feng wrote: > > > > On Wed, Sep 09, 2015 at 12:28:22PM -0700, Paul E. McKenney wrote: > > > My feeling is > > > that we should avoid saying too much about the internals of wait_event() > > > and wake_up(). > > I feel the same. I simply can't understand what we are trying to > document ;) > What I think we should document here is what memory ordering guarantee we can rely on with these sleep/wakeup primitives, or what kind of barriers these primitives imply. Because the structure of the memory-barriers.txt here is: (*) Implicit kernel memory barriers. - Locking functions. - Interrupt disabling functions. ->- Sleep and wake-up functions.<- - Miscellaneous functions. > For example, > > > A STORE-LOAD barrier is implied after setting task state by wait-related functions: > > > > prepare_to_wait(); > > prepare_to_wait_exclusive(); > > prepare_to_wait_event(); > > I won't argue, but to me this looks misleading too. > > Yes, prepare_to_wait()->set_current_state() implies mb() and thus > a STORE-LOAD barrier. > > But this has nothing to do with the explanation above. We do not > need this barrier to avoid the race with wake_up(). Again, again, > we can safely rely on wq->lock and acquire/release semantics. > Yes, you are right. prepare_to_wait*() should be put here. What should be put here is set_current_state(), whose STORE-LOAD barrier pairs with the STORE-LOAD barrier of wake_up_process(). > This barrier is only needed if you do, say, > > CONDITION = 1; > > if (waitqueue_active(wq)) > wake_up(wq); > > And note that the code above is wrong without another mb() after > CONDITION = 1. > Understood, I admit I didn't realize this before. To summarize, we have three kinds of data related to sleep/wakeup: * CONDITIONs: global data used to indicate events * task states * wait queues(may not be used, if users use set_current_state() + schedule() to sleep and wake_up_process() to wake up) IIUC, the race on wait queues are almost avoided because of wq->locks, and if a wait queue is used, race on task states are avoided because states are readed and written with a wq->lock held in sleep/wakeup functions. So only in two cases we need STORE-LOAD barriers to avoid the race: 1. no wait queue used(e.g. rcu_boost_kthread), we need STORE-LOAD to order accesses to task states and CONDITIONs, so we have barriers in wake_up_process() and set_current_state(). 2. wait queue accessed without a wq->lock held(e.g. your example), we need STORE-LOAD to order accesses to wait queues and CONDITIONs Since case #1 still exists in kernel, we'd better keep this section in memory-barriers.txt, however, I'm not sure whether we should mention case #2 in this section. Here is a modified version, without mentioning case #2: SLEEP AND WAKE-UP FUNCTIONS --------------------------- Sleeping and waking on an event flagged in global data can be viewed as an interaction between two pieces of data: the task state of the task waiting for the event and the global data used to indicate the event. To make sure that these appear to happen in the right order, the primitives to begin the process of going to sleep, and the primitives to initiate a wake up imply certain barriers. If a wait queue is used, all accesses to task states are protected by the lock of the wait queue, so the race on task states are avoided. However, if no wait queue used, we need some memory ordering guantanee to avoid the race between sleepers/wakees and wakers. The memory ordering requirement here can be expressed by two STORE-LOAD barriers(barriers which can guarantee a STORE perceding it can never be reordered after a LOAD following it). One STORE-LOAD barrier is needed on the sleeper/wakee side, before reading a variable used to indicate the event and after setting the state of the current task. Another STORE-LOAD barrier is needed on the waker side, before reading the state of the wakee task and after setting a variable used to indicate the event. These two barriers can pair with each other to avoid race conditions between sleepers/wakees and wakers: sleepr/wakee on CPU 1 waker on CPU 2 ======================== ======================== { wakee->state = TASK_RUNNING, event_indicated = 0 } STORE current->state=TASK_INTERRUPTIBLE c = LOAD event_indicated STORE event_indicated=1 s = LOAD wakee->state assert(!(c==0 && s == TASK_RUNNING)); A STORE-LOAD barrier is implied after setting task state in set_current_state() and before reading task state in wake_up_process() Make sure call set_current_state() before read the global data used to indicate event and sleep, and call wake_up_process() after set the global data used to indicate a event. Regards, Boqun