From mboxrd@z Thu Jan 1 00:00:00 1970 From: Coly Li Subject: Re: [PATCH V3 1/2] RAID1: a new I/O barrier implementation to remove resync window Date: Sat, 18 Feb 2017 10:40:35 +0800 Message-ID: <8be07d2a-1611-4552-9af8-946b8002b577@suse.de> References: <1487176523-109075-1-git-send-email-colyli@suse.de> <87shnevcpr.fsf@notabene.neil.brown.name> <20170217194117.cgbdm247p5p4gj6v@kernel.org> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Return-path: In-Reply-To: <20170217194117.cgbdm247p5p4gj6v@kernel.org> Sender: linux-raid-owner@vger.kernel.org To: Shaohua Li , NeilBrown Cc: linux-raid@vger.kernel.org, Shaohua Li , Johannes Thumshirn , Guoqing Jiang List-Id: linux-raid.ids On 2017/2/18 上午3:41, Shaohua Li wrote: > On Thu, Feb 16, 2017 at 06:04:00PM +1100, NeilBrown wrote: [snip] >>> @@ -1447,36 +1501,26 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio, >>> >>> static void raid1_make_request(struct mddev *mddev, struct bio *bio) >>> { >>> - struct r1conf *conf = mddev->private; >>> - struct r1bio *r1_bio; >>> + void (*make_request_fn)(struct mddev *mddev, struct bio *bio); >>> + struct bio *split; >>> + sector_t sectors; >>> >>> - /* >>> - * make_request() can abort the operation when read-ahead is being >>> - * used and no empty request is available. >>> - * >>> - */ >>> - r1_bio = mempool_alloc(conf->r1bio_pool, GFP_NOIO); >>> + make_request_fn = (bio_data_dir(bio) == READ) ? >>> + raid1_read_request : raid1_write_request; >>> >>> - r1_bio->master_bio = bio; >>> - r1_bio->sectors = bio_sectors(bio); >>> - r1_bio->state = 0; >>> - r1_bio->mddev = mddev; >>> - r1_bio->sector = bio->bi_iter.bi_sector; >>> - >>> - /* >>> - * We might need to issue multiple reads to different devices if there >>> - * are bad blocks around, so we keep track of the number of reads in >>> - * bio->bi_phys_segments. If this is 0, there is only one r1_bio and >>> - * no locking will be needed when requests complete. If it is >>> - * non-zero, then it is the number of not-completed requests. >>> - */ >>> - bio->bi_phys_segments = 0; >>> - bio_clear_flag(bio, BIO_SEG_VALID); >>> + /* if bio exceeds barrier unit boundary, split it */ >>> + do { >>> + sectors = align_to_barrier_unit_end( >>> + bio->bi_iter.bi_sector, bio_sectors(bio)); >>> + if (sectors < bio_sectors(bio)) { >>> + split = bio_split(bio, sectors, GFP_NOIO, fs_bio_set); >>> + bio_chain(split, bio); >>> + } else { >>> + split = bio; >>> + } >>> >>> - if (bio_data_dir(bio) == READ) >>> - raid1_read_request(mddev, bio, r1_bio); >>> - else >>> - raid1_write_request(mddev, bio, r1_bio); >>> + make_request_fn(mddev, split); >>> + } while (split != bio); >>> } >> >> I know you are going to change this as Shaohua wantsthe spitting >> to happen in a separate function, which I agree with, but there is >> something else wrong here. >> Calling bio_split/bio_chain repeatedly in a loop is dangerous. >> It is OK for simple devices, but when one request can wait for another >> request to the same device it can deadlock. >> This can happen with raid1. If a resync request calls raise_barrier() >> between one request and the next, then the next has to wait for the >> resync request, which has to wait for the first request. >> As the first request will be stuck in the queue in >> generic_make_request(), you get a deadlock. >> It is much safer to: >> >> if (need to split) { >> split = bio_split(bio, ...) >> bio_chain(...) >> make_request_fn(split); >> generic_make_request(bio); >> } else >> make_request_fn(mddev, bio); >> >> This way we first process the initial section of the bio (in 'split') >> which will queue some requests to the underlying devices. These >> requests will be queued in generic_make_request. >> Then we queue the remainder of the bio, which will be added to the end >> of the generic_make_request queue. >> Then we return. >> generic_make_request() will pop the lower-level device requests off the >> queue and handle them first. Then it will process the remainder >> of the original bio once the first section has been fully processed. > > Good point! raid10 has the same problem. Looks this doesn't solve the issue for > device with 3 times stack though. > > I knew you guys are working on this issue in block layer. Should we fix the > issue in MD side (for 2 stack devices) or wait for the block layer patch? Obviously I don't get the point at all ... Could you please explain a little more about why it is an issue and how it may happen ? Thanks a lot :-) Coly