On Tue, Nov 22 2016, Shaohua Li wrote: > On Tue, Nov 22, 2016 at 11:25:04AM +1100, Neil Brown wrote: >> On Tue, Nov 22 2016, Shaohua Li wrote: >> >> > On Mon, Nov 21, 2016 at 12:19:43PM +1100, Neil Brown wrote: >> >> There are 2 problems with using bi_phys_segments as a counter >> >> 1/ we only use 16bits, which limits bios to 256M >> >> 2/ it is poor form to reuse a field like this. It interferes >> >> with other changes to bios. >> >> >> >> We need to clean up a few things before we can change the use the >> >> counter which is now available inside a bio. >> >> >> >> I have only tested this lightly. More review and testing would be >> >> appreciated. >> > >> > So without the accounting, we: >> > - don't do bio completion trace >> >> Yes, but hopefully that will be added back to bio_endio() soon. >> >> > - call md_write_start/md_write_end excessively, which involves atomic operation. >> >> raid5_inc_bio_active_stripes() did an atomic operation. I don't think >> there is a net increase in the number of atomic operations. > > That's different. md_write_start/end uses a global atomic. > raid5_inc_bio_active_stripes uses a bio atomic. So we have more cache bouncing now. Maybe. Most md_write_start() calls are made in the context of raid5_make_request(). We could - call md_write_start() once at the start - count how many times we want to call it in a variable local to raid5_make_request() - atomically add that to the counter at the end. Similarly mode md_write_end() requests are in the context of raid5d. It could maintain local counter and apply them all in a single update before it sleeps. It would be a little messy, but not too horrible I think. > >> > >> > Not big problems. But we are actually reusing __bi_remaining, I'm wondering why >> > we not explicitly reuse it. Eg, adds bio_dec_remaining_return() and uses it >> > like raid5_dec_bi_active_stripes. >> >> Because using it exactly the same way that other places use it leads to >> fewer surprises, now or later. >> And I think that the effort to rearrange the code so that we could just >> call bio_endio() brought real improvements in code clarity and >> simplicity. > > Not the same way. The return_bi list and retry list fix are still good. We can > replace the bio_endio in your patch with something like: > if (bio_dec_remaining_return() == 0) { > trace_block_bio_complete() > md_write_end() > bio_endio(); > } > This will give us better control when to end io. This isn't safe. The bio arriving at raid5_make_request() might already have been split and could be chained. Then raid5 might never see bio_dec_remaining_return() return zero. For example, suppose there is a RAID0 make of some other device, and this RAID5. A write request arrives which crosses a chunk boundary. raid0.c calls bio_split to split off a new bio that will fit in the other device, leaving the original bio with a larger bi_sector which will get mapped only into the raid5. The split bio is chained into the original bio, elevating its __bi_remaining count. If the other device is particularly slow, or the RAID5 is particularly fast, the RAID5 IO might complete before the split bio completes, so raid5 will only see __bi_remaining go down to one, not zero. When the split bio finally completes, it's bi_endio is bio_chain_endio(), and that will call the final bio_endio() on the original bio. md_write_end() would then never be called. NeilBrown