Hi, This optimization is broken. The main concern here: Is it possible that lru_add_drain_all() _would_ have drained pagevec X, but then aborted because another lru_add_drain_all() is underway and that other task will _not_ drain pagevec X? I claim the answer is yes! My suggested changes are inline below. I attached a litmus test to verify it. On 2020-05-22, Peter Zijlstra wrote: > On Tue, May 19, 2020 at 11:45:24PM +0200, Ahmed S. Darwish wrote: >> @@ -713,10 +713,20 @@ static void lru_add_drain_per_cpu(struct work_struct *dummy) >> */ >> void lru_add_drain_all(void) >> { > >> + static unsigned int lru_drain_gen; >> static struct cpumask has_work; >> + static DEFINE_MUTEX(lock); >> + int cpu, this_gen; >> >> /* >> * Make sure nobody triggers this path before mm_percpu_wq is fully >> @@ -725,21 +735,48 @@ void lru_add_drain_all(void) >> if (WARN_ON(!mm_percpu_wq)) >> return; >> An smp_mb() is needed here. /* * Guarantee the pagevec counter stores visible by * this CPU are visible to other CPUs before loading * the current drain generation. */ smp_mb(); >> + this_gen = READ_ONCE(lru_drain_gen); >> + smp_rmb(); > > this_gen = smp_load_acquire(&lru_drain_gen); >> >> mutex_lock(&lock); >> >> /* >> + * (C) Exit the draining operation if a newer generation, from another >> + * lru_add_drain_all(), was already scheduled for draining. Check (A). >> */ >> + if (unlikely(this_gen != lru_drain_gen)) >> goto done; >> > >> + WRITE_ONCE(lru_drain_gen, lru_drain_gen + 1); >> + smp_wmb(); Instead of smp_wmb(), this needs to be a full memory barrier. /* * Guarantee the new drain generation is stored before * loading the pagevec counters. */ smp_mb(); > You can leave this smp_wmb() out and rely on the smp_mb() implied by > queue_work_on()'s test_and_set_bit(). > >> cpumask_clear(&has_work); >> - >> for_each_online_cpu(cpu) { >> struct work_struct *work = &per_cpu(lru_add_drain_work, cpu); >> > > While you're here, do: > > s/cpumask_set_cpu/__&/ > >> @@ -766,7 +803,7 @@ void lru_add_drain_all(void) >> { >> lru_add_drain(); >> } >> -#endif >> +#endif /* CONFIG_SMP */ >> >> /** >> * release_pages - batched put_page() For the litmus test: 1:rx=0 (P1 did not see the pagevec counter) 2:rx=1 (P2 _would_ have seen the pagevec counter) 2:ry1=0 /\ 2:ry2=1 (P2 aborted due to optimization) Changing the smp_mb() back to smp_wmb() in P1 and removing the smp_mb() in P2 represents this patch. And it shows that sometimes P2 will abort even though it would have drained the pagevec and P1 did not drain the pagevec. This is ugly as hell. And there maybe other memory barrier types to make it pretty. But as is, memory barriers are missing. John Ogness