Re: [RFC 4/5] r5cache: write part of r5cache

From: Song Liu <songliubraving@fb.com>
To: NeilBrown <nfbrown@novell.com>,
	"linux-raid@vger.kernel.org" <linux-raid@vger.kernel.org>
Cc: Shaohua Li <shli@fb.com>,
	"dan.j.williams@intel.com" <dan.j.williams@intel.com>,
	"hch@infradead.org" <hch@infradead.org>,
	Kernel Team <Kernel-team@fb.com>
Subject: Re: [RFC 4/5] r5cache: write part of r5cache
Date: Wed, 1 Jun 2016 13:36:02 +0000	[thread overview]
Message-ID: <8FB9AC1E-AB6D-41B2-9411-C924B23FA7AF@fb.com> (raw)
In-Reply-To: <87zir5r496.fsf@notabene.neil.brown.name>

>> path runs in state CLEAN and RUNNING (data in cache). Cache writes
>> start from r5c_handle_stripe_dirtying(), where bit R5_Wantcache is
>> set for devices with bio in towrite. Then, the data is written to
>> the journal through r5l_log implementation. Once the data is in the
>> journal, we set bit R5_InCache, and presue bio_endio for these
>> writes.
>
>I don't think this new state field is a good idea.  We already have
>plenty of state information and it would be best to fit in with that.
>When handle_stripe() runs it assesses the state of the stripe and
>decides what to do next.  The sh->count is incremented and it doesn't
>return to zero until all the actions that were found to be necessary
>have completely.  When it does return to zero, handle_stripe() runs
>again to see what else is needed.

Let me try to avoid the new state field. 

>
>You may well need to add sh->state or dev->flags flags to record new
>aspects of the state, such as whether data is safe in the log or safe in
>the RAID, but I don't think a new state variable will really help.
>
>>
>> The reclaim path starts by freezing the stripe (no more writes).
>> This stripes back to raid5 state machine, where
>> handle_stripe_dirtying will evaluate the stripe for reconstruct
>> writes or RMW writes (read data and calculate parity).
>>
>> For RMW, the code allocates extra page for prexor. Specifically,
>> a new page is allocated for r5dev->page to do prexor; while
>> r5dev->orig_page keeps the cached data. The extra page is freed
>> after prexor.
>
>It isn't at all clear to me why you need the extra page.  Could you
>explain when and why the ->page and the ->orig_page would contain
>different content and both of the content would be needed?
>

In prexor, we need read old data from the disk to perform the 
substract-xor. In the meanwhile, we need a buffer for the new data. 
When there is no write cache, the new data is in the page of bio. 
With write cache, however, the bio might be already released. 
Therefore, we will need an extra page here: orig_page for new 
Cached data, while newly allocated page just hold old data for prexor. 

>>
>> r5cache naturally excludes SkipCopy. With R5_Wantcache bit set,
>> async_copy_data will not skip copy.
>>
>> Before writing data to RAID disks, the r5l_log logic stores
>> parity (and non-overwrite data) to the journal.
>>
>> Instead of inactive_list, stripes with cached data are tracked in
>> r5conf->r5c_cached_list. r5conf->r5c_cached_stripes tracks how
>> many stripes has dirty data in the cache.
>>
>> Two sysfs entries are provided for the write cache:
>> 1. r5c_cached_stripes shows how many stripes have cached data.
>>    Writing anything to r5c_cached_stripes will flush all data
>>    to RAID disks.
>> 2. r5c_cache_mode provides knob to switch the system between
>>    write-back or write-through (only write-log).
>>
>> There are some known limitations of the cache implementation:
>>
>> 1. Write cache only covers full page writes (R5_OVERWRITE). Writes
>>    of smaller granularity are write through.
>> 2. Only one log io (sh->log_io) for each stripe at anytime. Later
>>    writes for the same stripe have to wait. This can be improved by
>>    moving log_io to r5dev.
>>
>> Signed-off-by: Song Liu <songliubraving@fb.com>
>> Signed-off-by: Shaohua Li <shli@fb.com>
>> ---
>>  drivers/md/raid5-cache.c | 399 +++++++++++++++++++++++++++++++++++++++++++++--
>>  drivers/md/raid5.c       | 172 +++++++++++++++++---
>>  drivers/md/raid5.h       |  38 ++++-
>>  3 files changed, 577 insertions(+), 32 deletions(-)
>>
>> diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
>> index 5f0d96f..66a3cd5 100644
>> --- a/drivers/md/raid5-cache.c
>> +++ b/drivers/md/raid5-cache.c
>> @@ -40,7 +40,19 @@
>>   */
>>  #define R5L_POOL_SIZE	4
>>  
>> +enum r5c_cache_mode {
>> +	R5C_MODE_NO_CACHE = 0,
>> +	R5C_MODE_WRITE_THROUGH = 1,
>> +	R5C_MODE_WRITE_BACK = 2,
>> +};
>> +
>> +static char *r5c_cache_mode_str[] = {"no-cache", "write-through", "write-back"};
>> +
>>  struct r5c_cache {
>> +	int flush_threshold;		/* flush the stripe when flush_threshold buffers are dirty  */
>> +	int mode;			/* enum r5c_cache_mode */
>
> why isn't this just "enum r5c_cache_mode mode;" ??
>
>Clearly this structure is more than just statistics.  But now I wonder
>why these fields aren't simply added to struct r5conf.  Or maybe in
>r5l_log.

I will try whether it is better to add these to r5conf. 

Thanks,
Songd
>
>
>NeilBrown