Re: [Qemu-devel] [PATCH v6 1/8] atomic: introduce atomic operations

From: Paolo Bonzini <pbonzini@redhat.com>
To: liu ping fan <qemulist@gmail.com>
Cc: Peter Maydell <peter.maydell@linaro.org>,
	Stefan Hajnoczi <stefanha@gmail.com>,
	Marcelo Tosatti <mtosatti@redhat.com>,
	qemu-devel@nongnu.org, Avi Kivity <avi@redhat.com>,
	Anthony Liguori <anthony@codemonkey.ws>,
	Jan Kiszka <jan.kiszka@siemens.com>
Subject: Re: [Qemu-devel] [PATCH v6 1/8] atomic: introduce atomic operations
Date: Thu, 15 Nov 2012 12:24:09 +0100	[thread overview]
Message-ID: <50A4D0D9.1060801@redhat.com> (raw)
In-Reply-To: <CAJnKYQ=S=y2_-Z=K9M1uV1MjgKO48q8HAJnAOsdU4gNZjym7iA@mail.gmail.com>

Il 15/11/2012 08:47, liu ping fan ha scritto:
>>>> RCU prototype required pointer-sized access, which you cannot make type-
>>> But I think that your RCU prototype should rely on atomic of CPU, not
>>> gcc‘s atomic.
>>
>> What's the difference?  gcc's atomic produces the same instructions as
>> hand-written assembly (or should).
>>
> Probably I made a mistake here, in vhost,  log =
> __sync_fetch_and_and(from, 0)  is used to fetch 64bits atomically in
> the case  32bits qemu running on 64bits linux.  Right?   But how can
> we read 32bits twice in atomic?

You can use cmpxchg8b.

>>> Otherwise, it could be slow (I guess something like spinlock there).
>>
>> Not sure what you mean.  I'm using two things: 1) write barriers; 2)
>> atomic_xchg of a pointer for fast, wait-free enqueuing of call_rcu
>> callbacks.
>
> Got your main idea. Will go through your prototype. Just one more
> question, why here atomic_xchg needed?  Could not the sequence "read
> old pointer, set new pointer" satisfy your requirement?

No, it would be racy.  Say you have this:

     wrong                     right
  -----------------------------------------
     ref(new)                  ref(new)
     old = tail                old = new
     tail = new                xchg(&old, &tail)
     old->next = new           old->next = new
     up(&sem)                  up(&sem)

where sem and tail are global, while old and new are local.  There can
be any number of producers as in the above scheme, and one consumer.  In
the consumer, iteration of the list starts from the second element (the
first element is dummy):

    dummy = new Node;
    tail = dummy;
    for(;;) {
        down(&sem);
        node = dummy->next;
        unref(dummy);
        visit node;
        dummy = node;
    }

Then the invariants are:

- the number of elements N in the list (starting from the dummy node and
following ->next) is always >1.  Because of this, you never have to
touch head when adding an element.

- the number of excess signals in the semaphore sem is always <= N-1.
Because of this, it doesn't matter that there is an instant in time
where tail is not reachable from head.  The element is really in the
list (as far as the consumer goes) only after the semaphore is signaled.

In the wrong version, two threads could read the same pointer:

    x = tail                                  |
                         y = tail             |
    tail = new_x                              |
                         tail = new_y         |
    x->next = new_x                           |   old_tail->next = new_x
    up(&sem)                                  |
                         y->next = new_y      |   old_tail->next = new_x
                         up(&sem)             |

No node points to new_x, so you have 2 nodes reachable from the head
(the dummy node, and new_y) with 2 signals on the semaphore,
contradicting the second invariant.

This instead works:

    x = new_x                                 |
                         y = new_y            |
    xchg(&x, &tail)                           |   tail = x, x = old_tail
                         xchg(&y, &tail)      |   tail = y, y = x
    x->next = new_x                           |   old_tail->next = new_x
    up(&sem)                                  |
                         y->next = new_y      |   x->next = new_y
                         up(&sem)             |

because you have 3 nodes and 2 signals.

Paolo