Re: [RFC] zram: support page-based parallel write

From: Minchan Kim <minchan@kernel.org>
To: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	linux-kernel@vger.kernel.org,
	Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Subject: Re: [RFC] zram: support page-based parallel write
Date: Thu, 8 Sep 2016 11:49:40 +0900	[thread overview]
Message-ID: <20160908024940.GB21691@bbox> (raw)
In-Reply-To: <20160908013444.GA505@swordfish>

Hi Sergey,

On Thu, Sep 08, 2016 at 10:34:44AM +0900, Sergey Senozhatsky wrote:
> Hello Minchan,
> 
> sorry, I don't have enough time at the moment to review it in details
> due to some urgent issues I'm working on.  can this wait?

Why not.

I need a time to complete the work for removing RFC tag, too.
I just posted incomplete code to discuss the direction, approach
with big design level, not a code detail level at this moment.

As waiting your time, I will enhance code qualitiy so you cannot see
the mess when you finally have a time to review. :)

> 
> 
> I was looking at loop.c awhile ago and was considering to do something
> similar to what they have done; but it never happened.
> 
> 
> I'm a bit 'surprised' that the performance has just 2x, whilst there
> are 4x processing threads. I'd rather expect it to be close to 4x... hm.

Me, too.
It seems it's related to work distribution to CPUs but not investigated
in detail.

> 
> 
> On (09/06/16 16:24), Minchan Kim wrote:
> [..]
> > +static void worker_wake_up(void)
> > +{
> > +	if (workers.nr_running * NR_BATCH_PAGES < workers.nr_req) {
> > +		int nr_wakeup = (workers.nr_req + NR_BATCH_PAGES) /
> > +				NR_BATCH_PAGES - workers.nr_running;
> > +
> > +		WARN_ON(!nr_wakeup);
> > +		wake_up_nr(&workers.req_wait, nr_wakeup);
> >  	}
> > +}
> 
> this seems to be quite complicated. use a wq perhaps? yes, there is a

Sure, It's Sooooo mess. I'm reconsidering now and finally should have
better way.

About wq, as you can guess by my function naming, I tried wq model first
for a while. It had several issues but can't say because I should know
wq internal detail to tell you exactly something but right now the time
is not allowed. Anyway, most important part with going thread model to me
is the job's completion time is totally unbounded.
While a thread is handling request, another context is queueing the request
endlessly. For this case, thread model is more proper, I think.

> common concern with the wq that all of the workers can stall during OOM,
> but you already have 2 kmalloc()-s in IO path (when adding a new request),
> so low memory condition concerns are out of sight here, I assume.

Yes, it's no problem because the logic have a fallback mechanism which
handle IO request synchronously if async logic is failed by some reason
(mostly -ENOMEM).

> 
> > +static int __zram_make_async_request(struct zram *zram, struct bio *bio)
> > +{
> > +	int offset;
> > +	u32 index;
> > +	struct bio_vec bvec;
> > +	struct bvec_iter iter;
> > +	LIST_HEAD(page_list);
> > +	struct bio_request *bio_req;
> > +	unsigned int nr_pages = 0;
> > +	bool write = op_is_write(bio_op(bio));
> > +
> > +	index = bio->bi_iter.bi_sector >> SECTORS_PER_PAGE_SHIFT;
> > +	offset = (bio->bi_iter.bi_sector &
> > +		  (SECTORS_PER_PAGE - 1)) << SECTOR_SHIFT;
> > +
> > +	bio_req = kmalloc(sizeof(*bio_req), GFP_NOIO);
> > +	if (!bio_req)
> > +		return 1;
> > +
> > +	/*
> > +	 * Keep bi_vcnt to complete bio handling when all of pages
> > +	 * in the bio are handled.
> > +	 */
> > +	bio_req->bio = bio;
> > +	atomic_set(&bio_req->nr_pages, 0);
> > +
> > +	bio_for_each_segment(bvec, bio, iter) {
> > +		int max_transfer_size = PAGE_SIZE - offset;
> > +
> > +		if (bvec.bv_len > max_transfer_size) {
> > +			/*
> > +			 * zram_bvec_rw() can only make operation on a single
> > +			 * zram page. Split the bio vector.
> > +			 */
> > +			struct bio_vec bv;
> > +
> > +			bv.bv_page = bvec.bv_page;
> > +			bv.bv_len = max_transfer_size;
> > +			bv.bv_offset = bvec.bv_offset;
> > +
> > +			if (queue_page_request_list(zram, bio_req, &bv,
> > +				index, offset, write, &page_list))
> > +				goto out;
> > +			nr_pages++;
> > +
> > +			bv.bv_len = bvec.bv_len - max_transfer_size;
> > +			bv.bv_offset += max_transfer_size;
> > +			if (queue_page_request_list(zram, bio_req, &bv,
> > +				index + 1, 0, write, &page_list))
> > +				goto out;
> > +			nr_pages++;
> > +		} else
> > +			if (queue_page_request_list(zram, bio_req, &bvec,
> > +				index, offset, write, &page_list))
> > +				goto out;
> > +			nr_pages++;
> 			^^^^^^^^^^
> +	else {
> 		if (queue_page_request_list(zram, bio_req, &bvec...
> 			.....
> 			nr_pages++;
> + 	}

Thanks but it's wrong too. :)

The logic shoud be this way.

        else {
                if (queue_page_request_list)
                        goto out
                nr_pages++
        }

I will fix it.

> 
> 
> > +		update_position(&index, &offset, &bvec);
> > +	}
> > +
> > +	run_worker(bio, &page_list, nr_pages);
> > +	return 0;
> > +
> > +out:
> > +	kfree(bio_req);
> > +
> > +	WARN_ON(!list_empty(&page_list));
> > +	return 1;
> > +}
> [..]
> > +static int create_workers(void)
> > +{
> > +	int i;
> > +	int nr_cpu = num_online_cpus();
> > +	struct zram_worker *worker;
> > +
> > +	INIT_LIST_HEAD(&workers.worker_list);
> > +	INIT_LIST_HEAD(&workers.req_list);
> > +	spin_lock_init(&workers.req_lock);
> > +	init_waitqueue_head(&workers.req_wait);
> > +
> > +	for (i = 0; i < nr_cpu; i++) {
> > +		worker = kmalloc(sizeof(*worker), GFP_KERNEL);
> > +		if (!worker)
> > +			goto error;
> > +
> > +		worker->task = kthread_run(zram_thread, NULL, "zramd-%d", i);
> 
> there may be several zram devices. may be    "zram-%d/%d, device_id, i"

zram thread is global. IOW, although there are hundred zram device,
the number of thread is equal to online CPU. ;)

> 
> > +		if (IS_ERR(worker->task)) {
> > +			kfree(worker);
> > +			goto error;
> > +		}
> > +
> > +		list_add(&worker->list, &workers.worker_list);
> > +	}
> > +
> > +	return 0;
> > +
> > +error:
> > +	destroy_workers();
> > +	return 1;
> > +}
> 
> 	-ss