Re: Buffered IO async context overhead

From: Andres Freund <andres@anarazel.de>
To: Jens Axboe <axboe@kernel.dk>
Cc: io-uring@vger.kernel.org
Subject: Re: Buffered IO async context overhead
Date: Mon, 24 Feb 2020 01:35:44 -0800	[thread overview]
Message-ID: <20200224093544.kg4kmuerevg7zooq@alap3.anarazel.de> (raw)
In-Reply-To: <4896063a-20d7-d2dd-c75e-a082edd5d72f@kernel.dk>

Hi,

On 2020-02-14 13:49:31 -0700, Jens Axboe wrote:
> [description of buffered write workloads being slower via io_uring
> than plain writes]
> Because I'm working on other items, I didn't read carefully enough. Yes
> this won't change the situation for writes. I'll take a look at this when
> I get time, maybe there's something we can do to improve the situation.

I looked a bit into this.

I think one issue is the spinning the workers do:

static int io_wqe_worker(void *data)
{

	while (!test_bit(IO_WQ_BIT_EXIT, &wq->state)) {
		set_current_state(TASK_INTERRUPTIBLE);
loop:
		if (did_work)
			io_worker_spin_for_work(wqe);
		spin_lock_irq(&wqe->lock);
		if (io_wqe_run_queue(wqe)) {

static inline void io_worker_spin_for_work(struct io_wqe *wqe)
{
	int i = 0;

	while (++i < 1000) {
		if (io_wqe_run_queue(wqe))
			break;
		if (need_resched())
			break;
		cpu_relax();
	}
}

even with the cpu_relax(), that causes quite a lot of cross socket
traffic, slowing down the submission side. Which after all frequently
needs to take the wqe->lock, just to be able to submit a queue
entry.

lock, work_list, flags all reside in one cacheline, so it's pretty
likely that a single io_wqe_enqueue would get the cacheline "stolen"
several times during one enqueue - without allowing any progress in the
worker, of course.

I also wonder if we can't avoid dequeuing entries one-by-one within the
worker, at least for the IO_WQ_WORK_HASHED case. Especially when writes
are just hitting the page cache, they're pretty fast, making it
plausible to cause pretty bad contention on the spinlock (even without
the spining above). Whereas the submission side is at least somewhat
likely to be able to submit several queue entries while the worker is
processing one job, that's pretty unlikely for workers.

In the hashed case there shouldn't be another worker processing entries
for the same hash. So it seems quite possible for the wqe to drain a few
of the entries for that hash within one spinlock acquisition, and then
process them one-by-one?

Greetings,

Andres Freund