Re: [PATCH RFC 2/2] futex: Implement mechanism to wait on any of several futexes - Pierre-Loup A. Griffais

From: "Pierre-Loup A. Griffais" <pgriffais@valvesoftware.com>
To: Zebediah Figura <z.figura12@gmail.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Gabriel Krisman Bertazi <krisman@collabora.com>
Cc: <mingo@redhat.com>, <peterz@infradead.org>,
	<dvhart@infradead.org>, <linux-kernel@vger.kernel.org>,
	<kernel@collabora.com>, Steven Noonan <steven@valvesoftware.com>
Subject: Re: [PATCH RFC 2/2] futex: Implement mechanism to wait on any of several futexes
Date: Wed, 31 Jul 2019 18:42:38 -0700	[thread overview]
Message-ID: <3af1586a-f5b8-9728-d140-4fc4709ba49c@valvesoftware.com> (raw)
In-Reply-To: <a7b54799-2fda-2e7b-821a-1ec9652e9596@gmail.com>

On 7/31/19 6:32 PM, Zebediah Figura wrote:
> On 7/31/19 8:22 PM, Zebediah Figura wrote:
>> On 7/31/19 7:45 PM, Thomas Gleixner wrote:
>>> If I assume a maximum of 65 futexes which got mentioned in one of the
>>> replies then this will allocate 7280 bytes alone for the futex_q 
>>> array with
>>> a stock debian config which has no debug options enabled which would 
>>> bloat
>>> the struct. Adding the futex_wait_block array into the same allocation
>>> becomes larger than 8K which already exceeds thelimit of SLUB kmem
>>> caches and forces the whole thing into the page allocator directly.
>>>
>>> This sucks.
>>>
>>> Also I'm confused about the 64 maximum resulting in 65 futexes 
>>> comment in
>>> one of the mails.
>>>
>>> Can you please explain what you are trying to do exatly on the user 
>>> space
>>> side?
>>
>> The extra futex comes from the fact that there are a couple of, as it
>> were, out-of-band ways to wake up a thread on Windows. [Specifically, a
>> thread can enter an "alertable" wait in which case it will be woken up
>> by a request from another thread to execute an "asynchronous procedure
>> call".] It's easiest for us to just add another futex to the list in
>> that case.
> 
> To be clear, the 64/65 distinction is an implementation detail that's 
> pretty much outside the scope of this discussion. I should have just 
> said 65 directly. Sorry about that.
> 
>>
>> I'd also point out, for whatever it's worth, that while 64 is a hard
>> limit, real applications almost never go nearly that high. By far the
>> most common number of primitives to select on is one.
>> Performance-critical code never tends to wait on more than three. The
>> most I've ever seen is twelve.
>>
>> If you'd like to see the user-side source, most of the relevant code is
>> at [1], in particular the functions __fsync_wait_objects() [line 712]
>> and do_single_wait [line 655]. Please feel free to ask for further
>> clarification.
>>
>> [1]
>> https://github.com/ValveSoftware/wine/blob/proton_4.11/dlls/ntdll/fsync.c

In addition, here's an example of how I think it might be useful to 
expose it to apps at large in a way that's compatible with existing 
pthread mutexes:

https://github.com/Plagman/glibc/commit/3b01145fa25987f2f93e7eda7f3e7d0f2f77b290

This patch hasn't received nearly as much testing as the Wine fsync code 
path, but that functionality would provide more CPU-efficient ways for 
thread pool code to sleep in our game engine. We also use eventfd today.

For this, I think the expected upper bound for the per-op futex count 
would be in the same order of magnitude as the logical CPU count on the 
target machine, similar as the Wine use-case.

Thanks,
  - Pierre-Loup

>>
>>
>>
>>>
>>> Thanks,
>>>
>>>     tglx
>>>
>>
>