From mboxrd@z Thu Jan 1 00:00:00 1970 From: will.deacon@arm.com (Will Deacon) Date: Fri, 9 May 2014 18:55:05 +0100 Subject: [PATCH v3] arm64: enable EDAC on arm64 In-Reply-To: <20140509173340.GO7950@arm.com> References: <1398096556-26799-1-git-send-email-robherring2@gmail.com> <20140422102455.GD7484@arm.com> <20140422132624.GC9820@arm.com> <20140422160100.GH9820@arm.com> <20140423170445.GI5649@arm.com> <20140509173340.GO7950@arm.com> Message-ID: <20140509175505.GC23083@arm.com> To: linux-arm-kernel@lists.infradead.org List-Id: linux-arm-kernel.lists.infradead.org On Fri, May 09, 2014 at 06:33:40PM +0100, Catalin Marinas wrote: > On Wed, Apr 23, 2014 at 06:04:45PM +0100, Will Deacon wrote: > > On Tue, Apr 22, 2014 at 05:29:52PM +0100, Rob Herring wrote: > > > On Tue, Apr 22, 2014 at 11:01 AM, Will Deacon wrote: > > > > Looking at the edac_mc_scrub_block code, atomic_scrub is always called with > > > > a normal, cacheable mapping (kmap_atomic) so that doesn't help us (although > > > > it means the exclusives will at least succeed). > > > > > > > > The problem of speculative reads by the CPU could be solved by unmapped the > > > > DMA buffer when we transfer the ownership over to the device (instead of > > > > invalidating it after the transfer). However, I'm now slightly confused as > > > > to how atomic_scrub fixes errors reported at any cache level higher than > > > > L1. Do we need cache-flushing to ensure that the exclusive-store propagates > > > > to the point of failure? > > > > > > The whole point of scrubbing is to stop repeated error reporting of > > > correctable errors. For example, you do a write to memory and the ECC > > > code is added to it. Suppose the data stored in the memory gets > > > corrupted either on the write or some time later you get a bit flip in > > > the memory cell. Then when the data is read from memory, the memory > > > controller will detect the error, correct it, and trigger and ECC > > > correctable error interrupt. It will do this every time you read that > > > memory location because the error occurred on the write. The only way > > > to clear the error is re-writing memory. > > > > Thanks for the explanation. > > > > > As long as that cache line is dirty, no reads from that memory location > > > will occur as other readers will get the line from other cores, the L2, or > > > the line will get pushed out to memory first. > > > > Agreed, if all of the readers are coherent. > > Just to get things moving on this patch, do we agree that it is only safe > for coherent DMA? If so, do we merge it on the grounds that people > needing EDAC only use it with DMA-coherent memory? The comment for > atomic_scrub should be updated to state coherent DMA only. > > We could check the dma_ops in atomic_scrub but I don't think it's worth > it. Well, I think we should do *something* for the non-coherent case, otherwise we're going to have fun debugging random buffer corruption. Could we disable scrubbing from the dma_bus_notifier the moment we find a non-coherent device? Will