All of lore.kernel.org
 help / color / mirror / Atom feed
* Lock-free dmix considered harmful
@ 2020-05-15 15:46 Maarten Baert
  2020-05-19 13:46 ` Takashi Iwai
  0 siblings, 1 reply; 6+ messages in thread
From: Maarten Baert @ 2020-05-15 15:46 UTC (permalink / raw)
  To: alsa-devel

I'm in the process of adding monitoring support to dmix (i.e. the 
ability for applications to capture the sound being played back through 
the speakers) - expect patches for this soon. However while reading the 
dmix source code I came across something weird. Dmix uses a lock-free 
mixing implementation based on (architecture-dependent) atomic 
operations such as atomic addition and compare-exchange. However at the 
same time dmix also uses a semaphore to prevent multiple processes from 
accessing the shared memory buffers simultaneously. This is redundant 
and makes the lock-free implementation rather pointless.

Out of curiosity I decided to measure the time it takes to execute 
snd_pcm_dmix_sync_area for these three cases:

- A simple locked implementation, which acquires the semaphore and then 
does mixing using the non-concurrent code from pcm_dmix_generic.c.
- The current (redundant) implementation, which acquires the semaphore 
and then does atomic mixing.
- A lock-free implementation, which does not acquire the semaphore and 
does atomic mixing.

The resulting sound is identical in all three cases. (As a sanity check 
I also tested the 4th case, non-concurrent code without locking, which 
as expected produces audio with glitches.)

I tested each case with 1, 4 and 20 simultaneous playback streams 
(multiple instances of 'aplay') on a system with 4 physical CPU cores 
(Intel Core i7-4770). You can see the results here:

https://misc.maartenbaert.be/dmix_perf_locked_atomic.png

The results are quite surprising:

- The locked implementation is 3 to 6 times *faster* than the current 
redundant implementation.
- The lock-free implementation is usually about as fast as the current 
redundant implementation, but has extreme outliers that are up to 3 
times *slower* than the redundant implementation when there are multiple 
threads.

It seems that the overhead of the atomic operations is so high that it 
completely negates the theoretical advantage of being able to mix from 
multiple threads simultaneously. This may be the result of 'false 
sharing' when multiple threads are accessing different words in the same 
cache line (which is almost impossible to avoid when mixing audio).

On a related note, I believe that I have also found a bug in the 
lock-free implementation. The logic which is used to clear the 'sum' 
buffer is as follows:

sample = *src;
old_sample = *sum;
if (ARCH_CMPXCHG(dst, 0, 1) == 0)
     sample -= old_sample;
ARCH_ADD(sum, sample);

This takes advantage of the fact that the hardware driver clears the 
playback buffer after reading it. However it does not account for the 
possibility that 'dst' might be zero for other reasons, such as:

- An application plays back silence.
- An application plays back sound, but then rewinds it.
- Multiple applications play back sound which just happens to sum to 
zero occasionally.

These all result in situations where 'dst' is changed from 0 to 1 by the 
compare-exchange operation, but then changed back to zero later, 
resulting in the classic 'ABA problem'. However this problem is 
currently hidden by the fact that the lock-free implementation is in 
fact still locked.

Because of these reasons I think it would be better to drop the 
lock-free implementation entirely and just use the existing 
non-concurrent architecture-independent implementation from 
pcm_dmix_generic.c. Aside from being faster, it would also eliminate a 
lot of architecture-dependent code and inline assembly. Should I submit 
a patch for this?

Best regards,
Maarten Baert


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Lock-free dmix considered harmful
  2020-05-15 15:46 Lock-free dmix considered harmful Maarten Baert
@ 2020-05-19 13:46 ` Takashi Iwai
  2020-05-19 13:51   ` Jaroslav Kysela
  2020-05-19 16:41   ` Maarten Baert
  0 siblings, 2 replies; 6+ messages in thread
From: Takashi Iwai @ 2020-05-19 13:46 UTC (permalink / raw)
  To: Maarten Baert; +Cc: alsa-devel

On Fri, 15 May 2020 17:46:03 +0200,
Maarten Baert wrote:
> 
> I'm in the process of adding monitoring support to dmix (i.e. the ability for applications to capture the sound being played back through the speakers) - expect patches for this soon. However while reading the dmix source code I came across something weird. Dmix uses a lock-free mixing implementation based on (architecture-dependent) atomic operations such as atomic addition and compare-exchange. However at the same time dmix also uses a semaphore to prevent multiple processes from accessing the shared memory buffers simultaneously. This is redundant and makes the lock-free implementation rather pointless.

Indeed, this was rather a bug and introduced in the commit
267d7c728196286726b47df45736eba18d78822a
    Add support of little-endian on i386/x86_64 dmix

It's a really old bug, since 2007.  The semaphore down/up should
have been controlled with the dynamic flag instead of the
conditional build.

So, practically seen, people have been never bothered by this pthread
mutex :)

> Out of curiosity I decided to measure the time it takes to execute snd_pcm_dmix_sync_area for these three cases:
> 
> - A simple locked implementation, which acquires the semaphore and then does mixing using the non-concurrent code from pcm_dmix_generic.c.
> - The current (redundant) implementation, which acquires the semaphore 
> and then does atomic mixing.
> - A lock-free implementation, which does not acquire the semaphore and does atomic mixing.
> 
> The resulting sound is identical in all three cases. (As a sanity check I also tested the 4th case, non-concurrent code without locking, which as expected produces audio with glitches.)
> 
> I tested each case with 1, 4 and 20 simultaneous playback streams (multiple instances of 'aplay') on a system with 4 physical CPU cores (Intel Core i7-4770). You can see the results here:
> 
> https://misc.maartenbaert.be/dmix_perf_locked_atomic.png
> 
> The results are quite surprising:
> 
> - The locked implementation is 3 to 6 times *faster* than the current redundant implementation.
> - The lock-free implementation is usually about as fast as the current 
> redundant implementation, but has extreme outliers that are up to 3 times *slower* than the redundant implementation when there are multiple threads.
> 
> It seems that the overhead of the atomic operations is so high that it completely negates the theoretical advantage of being able to mix from multiple threads simultaneously. This may be the result of 'false sharing' when multiple threads are accessing different words in the same cache line (which is almost impossible to avoid when mixing audio).

Yes, that's a known drawback of the current lockless implementation.
Rather the surprising (or infamous) point is that the buffer access
costs very much.  I guess the difference became more significant on
the modern CPUs.

Also, such an effect is more visible when the allocated memory has a
non-cached attribute.  I experienced the almost unacceptable delay by
dmix on Atom HDMI playback, for example.

> On a related note, I believe that I have also found a bug in the lock-free implementation. The logic which is used to clear the 'sum' buffer is as follows:
> 
> sample = *src;
> old_sample = *sum;
> if (ARCH_CMPXCHG(dst, 0, 1) == 0)
>     sample -= old_sample;
> ARCH_ADD(sum, sample);
> 
> This takes advantage of the fact that the hardware driver clears the playback buffer after reading it. However it does not account for the possibility that 'dst' might be zero for other reasons, such as:
> 
> - An application plays back silence.
> - An application plays back sound, but then rewinds it.
> - Multiple applications play back sound which just happens to sum to zero occasionally.
> 
> These all result in situations where 'dst' is changed from 0 to 1 by the compare-exchange operation, but then changed back to zero later, resulting in the classic 'ABA problem'. However this problem is currently hidden by the fact that the lock-free implementation is in fact still locked.

Hm, I'm not sure whether whether this really leads to the deadlock,
but I can't exclude such a possibility in some corner cases.

> Because of these reasons I think it would be better to drop the lock-free implementation entirely and just use the existing non-concurrent architecture-independent implementation from pcm_dmix_generic.c. Aside from being faster, it would also eliminate a lot of architecture-dependent code and inline assembly. Should I submit a patch for this?

The advantage of lockless dmix implementation isn't about its CPU usage
but the nature where a stream isn't prevented by other streams, which
assures the very low latency, too.  That is, with the generic dmix, a
stream can be still halted when another stream stalls in the middle,
and there is no way to recover from it.

So, IMO, we can start with a dynamic configuration to choose the
behavior and add a configure option to choose the implementations.
The default behavior should be still an open question, though.


thanks,

Takashi

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Lock-free dmix considered harmful
  2020-05-19 13:46 ` Takashi Iwai
@ 2020-05-19 13:51   ` Jaroslav Kysela
  2020-06-19 17:21     ` Takashi Iwai
  2020-05-19 16:41   ` Maarten Baert
  1 sibling, 1 reply; 6+ messages in thread
From: Jaroslav Kysela @ 2020-05-19 13:51 UTC (permalink / raw)
  To: Takashi Iwai, Maarten Baert; +Cc: alsa-devel

Dne 19. 05. 20 v 15:46 Takashi Iwai napsal(a):
> 
>> Because of these reasons I think it would be better to drop the lock-free implementation entirely and just use the existing non-concurrent architecture-independent implementation from pcm_dmix_generic.c. Aside from being faster, it would also eliminate a lot of architecture-dependent code and inline assembly. Should I submit a patch for this?
> 
> The advantage of lockless dmix implementation isn't about its CPU usage
> but the nature where a stream isn't prevented by other streams, which
> assures the very low latency, too.  That is, with the generic dmix, a
> stream can be still halted when another stream stalls in the middle,
> and there is no way to recover from it.
> 
> So, IMO, we can start with a dynamic configuration to choose the
> behavior and add a configure option to choose the implementations.
> The default behavior should be still an open question, though.

I fully agree here.

					Jaroslav

-- 
Jaroslav Kysela <perex@perex.cz>
Linux Sound Maintainer; ALSA Project; Red Hat, Inc.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Lock-free dmix considered harmful
  2020-05-19 13:46 ` Takashi Iwai
  2020-05-19 13:51   ` Jaroslav Kysela
@ 2020-05-19 16:41   ` Maarten Baert
  1 sibling, 0 replies; 6+ messages in thread
From: Maarten Baert @ 2020-05-19 16:41 UTC (permalink / raw)
  To: alsa-devel

On 19/05/2020 15:46, Takashi Iwai wrote:
> Hm, I'm not sure whether whether this really leads to the deadlock,
> but I can't exclude such a possibility in some corner cases.

It is not a deadlock, rather there exist scenarios where the buffer is 
cleared more than once, and since clearing is done by subtracting the 
old value, this results in incorrect audio.

Consider the scenario where before mixing starts, *dst = 0 and *sum = 
100. Now consider that application A plays back 10 and application B 
plays back 0. The expected result is 10. However this can happen:

A: sample = *src;                                 (A:sample == 10)
A: old_sample = *sum;                             (A:old_sample == 100)

<< here A gets interrupted by B>

B: sample = *src;                                 (B:sample == 0)
B: old_sample = *sum;                             (B:old_sample == 100)
B: if (ARCH_CMPXCHG(dst, 0, 1) == 0)              (*dst == 1)
B:     sample -= old_sample;                      (B:sample == -100)
B: ARCH_ADD(sum, sample);                         (*sum == 0)
B: [copy sum to dst]                              (*dst == 0)

<< here A resumes >>

A: if (ARCH_CMPXCHG(dst, 0, 1) == 0)              (*dst == 1)
A:     sample -= old_sample;                      (A:sample == -90)
A: ARCH_ADD(sum, sample);                         (*sum == -90)
A: [copy sum to dst]                              (*dst == -90)

In the end, *sum and *dst are -90 instead of 10.
> The advantage of lockless dmix implementation isn't about its CPU usage
> but the nature where a stream isn't prevented by other streams, which
> assures the very low latency, too.  That is, with the generic dmix, a
> stream can be still halted when another stream stalls in the middle,
> and there is no way to recover from it.

I agree, however considering that the lock has (unintentionally) been 
enabled for 13 years, and no one noticed until now, it doesn't seem to 
be a problem. I think there is a higher risk that removing the lock will 
expose new bugs due to flaws in the lock-free algorithm that were 
previously hidden by the lock.

> So, IMO, we can start with a dynamic configuration to choose the
> behavior and add a configure option to choose the implementations.
> The default behavior should be still an open question, though.

I can do that too, but monitoring is significantly easier to implement 
in the locked case. Also, I see no way to fix the lock-free algorithm 
without making changes to the shared memory layout (which would break 
compatibility between ALSA versions, which is a problem for anyone who 
uses chroots or for some reason has multiple versions of libasound on 
their system (e.g. Steam includes its own version of libasound). And I 
would rather not implement a lock-free implementation which I know is 
not fully correct.

Best regards,
Maarten Baert


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Lock-free dmix considered harmful
  2020-05-19 13:51   ` Jaroslav Kysela
@ 2020-06-19 17:21     ` Takashi Iwai
  2020-06-20 19:38       ` Maarten Baert
  0 siblings, 1 reply; 6+ messages in thread
From: Takashi Iwai @ 2020-06-19 17:21 UTC (permalink / raw)
  To: Jaroslav Kysela; +Cc: Maarten Baert, alsa-devel

On Tue, 19 May 2020 15:51:56 +0200,
Jaroslav Kysela wrote:
> 
> Dne 19. 05. 20 v 15:46 Takashi Iwai napsal(a):
> >
> >> Because of these reasons I think it would be better to drop the lock-free implementation entirely and just use the existing non-concurrent architecture-independent implementation from pcm_dmix_generic.c. Aside from being faster, it would also eliminate a lot of architecture-dependent code and inline assembly. Should I submit a patch for this?
> >
> > The advantage of lockless dmix implementation isn't about its CPU usage
> > but the nature where a stream isn't prevented by other streams, which
> > assures the very low latency, too.  That is, with the generic dmix, a
> > stream can be still halted when another stream stalls in the middle,
> > and there is no way to recover from it.
> >
> > So, IMO, we can start with a dynamic configuration to choose the
> > behavior and add a configure option to choose the implementations.
> > The default behavior should be still an open question, though.
> 
> I fully agree here.

OK, I implemented a couple of patches for this.

The default behavior is a bigger question and I chose disabling the
lockless for less CPU usage.  But it's selectable via configure option
or the runtime asoundrc setup (there is already an option 
dmix.direct_memory_access).


Takashi

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Lock-free dmix considered harmful
  2020-06-19 17:21     ` Takashi Iwai
@ 2020-06-20 19:38       ` Maarten Baert
  0 siblings, 0 replies; 6+ messages in thread
From: Maarten Baert @ 2020-06-20 19:38 UTC (permalink / raw)
  To: Takashi Iwai, Jaroslav Kysela; +Cc: alsa-devel

On 19/06/2020 19:21, Takashi Iwai wrote:
> OK, I implemented a couple of patches for this.
>
> The default behavior is a bigger question and I chose disabling the
> lockless for less CPU usage.  But it's selectable via configure option
> or the runtime asoundrc setup (there is already an option
> dmix.direct_memory_access).
>
>
> Takashi

Thank you, this solution should provide the best of both worlds. I will 
rebase my dmix monitoring patches on this improved version, I think I 
know a way to implement monitoring that does not depend too much on 
whether playback uses the locked or lock-free implementation.

Best regards,
Maarten Baert


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2020-06-20 19:39 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-15 15:46 Lock-free dmix considered harmful Maarten Baert
2020-05-19 13:46 ` Takashi Iwai
2020-05-19 13:51   ` Jaroslav Kysela
2020-06-19 17:21     ` Takashi Iwai
2020-06-20 19:38       ` Maarten Baert
2020-05-19 16:41   ` Maarten Baert

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.