Re: BUG: aio/direct-io data corruption in 4.7

From: Gregory Shapiro <shapiro.gregory@gmail.com>
To: jack.wang.usish@gmail.com
Cc: hch@infradead.org, jnicklin@blockbridge.com,
	linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	gregory.shapiro@kaminario.com
Subject: Re: BUG: aio/direct-io data corruption in 4.7
Date: Tue, 6 Nov 2018 13:31:44 +0200	[thread overview]
Message-ID: <CAAGb-1eLHwRy2L5tp+skWdV5zJ7jfCwTe-1UnJq_n5yjQfdJug@mail.gmail.com> (raw)
In-Reply-To: <CA+res+SDaTwjkKQpnREP1dW8XmtQsk7NGLj=heYGbGeL3Dr-OQ@mail.gmail.com>

Hi Jack,
I tested it in 4.9.102 and I checked the latest code from elixir
(versions 4.19 and 4.20) and the error in code is still present there.
More on the scenario and the bug:
I experienced data corruption in my application (nvme based storage).
The issue was caused because of faulty hardware, but the real problem
is I got a correct number of bytes in io_getevents thus couldn't
recognize it correctly the error.
Looking at the /var/log/messages and  I saw the following errors in
time of coruption:

Oct 11 14:55:15 block01-node05 kernel: [19272.951015]
blk_update_request: I/O error, dev nvme2n3, sector 117359360
Oct 11 14:55:15 block01-node05 kernel: [19272.952786]
blk_update_request: I/O error, dev nvme2n3, sector 117359872
Oct 11 14:55:16 block01-node05 kernel: [19273.544374]
blk_update_request: I/O error, dev nvme2n3, sector 117360384
...
So the block level does receive information about the error, but I
don't see it in the application.
running ftrace and doing code reading I find out that dio error status
is overridden.
In dio_complete it is propagated in (dio->io_error and if
dio->io_error is not zero in we are in async write
the status is overridden by transferred.

static ssize_t dio_complete(struct dio *dio, ssize_t ret, bool is_async)
{
...
        if (ret == 0)
                ret = dio->page_errors;
        if (ret == 0)
                ret = dio->io_error;
        if (ret == 0)
                ret = transferred;
...
        if (is_async) {
                /*
                 * generic_write_sync expects ki_pos to have been updated
                 * already, but the submission path only does this for
                 * synchronous I/O.
                 */
                dio->iocb->ki_pos += transferred;

                if (dio->op == REQ_OP_WRITE)
                        ret = generic_write_sync(dio->iocb,  transferred);
                dio->iocb->ki_complete(dio->iocb, ret, 0);

For your convenience I am attaching ftrace log to for easier tracking
the flow in the code:

 26)               |                nvme_complete_rq [nvme_core]() {
 26)               |                  blk_mq_end_request() {
 26)               |                    blk_update_request() { <----
log is from here
 26)   0.563 us    |                      blk_account_io_completion();
 26)   0.263 us    |                      bio_advance();
 26)               |                      bio_endio() {
 26)               |                        dio_bio_end_aio() {
 26)               |                          dio_bio_complete() {
 26)               |                            bio_check_pages_dirty() {
 26)               |                              bio_put() {
 26)               |                                bio_free() {
 26)               |                                  __bio_free() {
 26)   0.045 us    |                                    bio_disassociate_task();
 26)   0.497 us    |                                  }
 26)   0.042 us    |                                  bvec_free();
 26)               |                                  mempool_free() {
 26)               |                                    mempool_free_slab() {
 26)   0.264 us    |                                      kmem_cache_free();
 26)   0.606 us    |                                    }
 26)   1.125 us    |                                  }
 26)   2.588 us    |                                }
 26)   2.920 us    |                              }
 26)   3.979 us    |                            }
 26)   4.712 us    |                          }
 26)   0.040 us    |                          _raw_spin_lock_irqsave();
 26)   0.048 us    |                          _raw_spin_unlock_irqrestore();
 26)               |                          dio_complete() {
dio_complete(dio, 0, true);
 26)               |                            aio_complete() {
dio->iocb->ki_complete(dio->iocb, ret, 0); <<aio_complete(struct kiocb
*kiocb, long res, long res2)>>
 26)   0.073 us    |                              _raw_spin_lock_irqsave();
 26)   0.114 us    |                              refill_reqs_available();
 26)   0.048 us    |                              _raw_spin_unlock_irqrestore();
 26)               |                              kiocb_free() {
 26)   0.171 us    |                                fput();
 26)   0.102 us    |                                kmem_cache_free();
 26)   0.902 us    |                              }

        }

On Tue, Nov 6, 2018 at 9:29 AM Jack Wang <jack.wang.usish@gmail.com> wrote:
>
> Gregory Shapiro <shapiro.gregory@gmail.com> 于2018年11月5日周一 下午4:19写道：
> >
> > Hello, my name is Gregory Shapiro and I am a newbie on this list.
> > I recently encountered data corruption as I got a kernel to
> > acknowledge write ("io_getevents" system call with a correct number of
> > bytes) but undergoing write to disk failed.
> > After investigating the problem I found it is identical to issue found
> > in direct-io.c mentioned the bellow thread.
> > https://lore.kernel.org/lkml/20160921141539.GA17898@infradead.org/
> > Is there a reason proposed patch didn't apply to the kernel?
> > When can I expect it to be applied?
> > Thanks,
> >  Gregory
>
> Hi Gregory,
>
> Thanks for your info.
> Have you tried with latest kernel other than 4.7, is the problem still there?
>
> Could you share your test case?
>
> Regards,
> Jack Wang