On Wed, 2016-02-03 at 13:07 +0200, Oded Gabbay wrote: > > Another, perhaps trivial, question. > > When there is an address fault, who handles it ? the SVM driver, or > > each device driver ? > > > > In other words, is the model the same as (AMD) IOMMU where it binds > > amd_iommu driver to the IOMMU H/W, and that driver (amd_iommu/v2) is > > the only one which handles the PPR events ? > > > > If that is the case, then with SVM, how will the device driver be made > > aware of faults, if the SVM driver won't notify him about them, > > because it has already severed the connection between PASID and > > process ? In the ideal case, there's no need for the device driver to get involved at all. When a page isn't found in the page tables, the IOMMU code calls handle_mm_fault() and either populates the page and sends a a 'success' response, or sends an 'invalid fault' response back. To account for broken hardware, we *have* added a callback into the device driver when these faults happen. Ideally it should never be used, of course. In the case where the process has gone away, the PASID is still assigned and we still hold mm_count on the MM, just not mm_users. This callback into the device driver still occurs if a fault happens during process exit between the exit_mm() and exit_files() stage. > And another question, if I may, aren't you afraid of "false positive" > prints to dmesg ? I mean, I'm pretty sure page faults / pasid faults > errors will be logged somewhere, probably to dmesg. Aren't you > concerned of the users seeing those errors and thinking they may have > a bug, while actually the errors were only caused by process > termination ? If that's the case, it's easy enough to silence them. We are already explicitly testing for the 'defunct mm' case in our fault handler, to prevent us from faulting more pages into an obsolescent MM after its mm_users reaches zero and its page tables are supposed to have been torn down. That's the 'if(!atomic_inc_not_zere(&svm->mm->mm_users)) goto bad_req;' part. > Or in that case you say that the application is broken, because if it > still had something running in the H/W, it should not have closed > itself ? That's also true but it's still nice to avoid confusion. Even if only to disambiguate cause and effect — we don't want people to see PASID faults which were caused by the process crashing, and to think that they might be involved in *causing* that process to crash... -- David Woodhouse Open Source Technology Centre David.Woodhouse@intel.com Intel Corporation