android_kernel_msm-6.1_noth.../include
David Hildenbrand 088b8aa537 mm: fix PageAnonExclusive clearing racing with concurrent RCU GUP-fast
commit 6c287605fd ("mm: remember exclusively mapped anonymous pages with
PG_anon_exclusive") made sure that when PageAnonExclusive() has to be
cleared during temporary unmapping of a page, that the PTE is
cleared/invalidated and that the TLB is flushed.

What we want to achieve in all cases is that we cannot end up with a pin on
an anonymous page that may be shared, because such pins would be
unreliable and could result in memory corruptions when the mapped page
and the pin go out of sync due to a write fault.

That TLB flush handling was inspired by an outdated comment in
mm/ksm.c:write_protect_page(), which similarly required the TLB flush in
the past to synchronize with GUP-fast. However, ever since general RCU GUP
fast was introduced in commit 2667f50e8b ("mm: introduce a general RCU
get_user_pages_fast()"), a TLB flush is no longer sufficient to handle
concurrent GUP-fast in all cases -- it only handles traditional IPI-based
GUP-fast correctly.

Peter Xu (thankfully) questioned whether that TLB flush is really
required. On architectures that send an IPI broadcast on TLB flush,
it works as expected. To synchronize with RCU GUP-fast properly, we're
conceptually fine, however, we have to enforce a certain memory order and
are missing memory barriers.

Let's document that, avoid the TLB flush where possible and use proper
explicit memory barriers where required. We shouldn't really care about the
additional memory barriers here, as we're not on extremely hot paths --
and we're getting rid of some TLB flushes.

We use a smp_mb() pair for handling concurrent pinning and a
smp_rmb()/smp_wmb() pair for handling the corner case of only temporary
PTE changes but permanent PageAnonExclusive changes.

One extreme example, whereby GUP-fast takes a R/O pin and KSM wants to
convert an exclusive anonymous page to a KSM page, and that page is already
mapped write-protected (-> no PTE change) would be:

	Thread 0 (KSM)			Thread 1 (GUP-fast)

					(B1) Read the PTE
					# (B2) skipped without FOLL_WRITE
	(A1) Clear PTE
	smp_mb()
	(A2) Check pinned
					(B3) Pin the mapped page
					smp_mb()
	(A3) Clear PageAnonExclusive
	smp_wmb()
	(A4) Restore PTE
					(B4) Check if the PTE changed
					smp_rmb()
					(B5) Check PageAnonExclusive

Thread 1 will properly detect that PageAnonExclusive was cleared and
back off.

Note that we don't need a memory barrier between checking if the page is
pinned and clearing PageAnonExclusive, because stores are not
speculated.

The possible issues due to reordering are of theoretical nature so far
and attempts to reproduce the race failed.

Especially the "no PTE change" case isn't the common case, because we'd
need an exclusive anonymous page that's mapped R/O and the PTE is clean
in KSM code -- and using KSM with page pinning isn't extremely common.
Further, the clear+TLB flush we used for now implies a memory barrier.
So the problematic missing part should be the missing memory barrier
after pinning but before checking if the PTE changed.

Link: https://lkml.kernel.org/r/20220901083559.67446-1-david@redhat.com
Fixes: 6c287605fd ("mm: remember exclusively mapped anonymous pages with PG_anon_exclusive")
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Nadav Amit <namit@vmware.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Andrea Parri <parri.andrea@gmail.com>
Cc: Will Deacon <will@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Christoph von Recklinghausen <crecklin@redhat.com>
Cc: Don Dutile <ddutile@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-11 20:26:11 -07:00
..
acpi Merge branch 'acpi-properties' 2022-08-11 19:21:03 +02:00
asm-generic Seventeen hotfixes. Mostly memory management things. Ten patches are 2022-08-28 14:49:59 -07:00
clocksource - Add the missing DT bindings for the MTU nomadik timer (Linus 2022-07-28 12:33:34 +02:00
crypto for-5.20/block-2022-08-04 2022-08-04 20:00:14 -07:00
drm Driver uAPI changes: 2022-07-22 15:51:31 +10:00
dt-bindings power supply and reset changes for the v6.0 series 2022-08-12 09:37:33 -07:00
keys
kunit
kvm KVM: arm64: vgic: Consolidate userspace access for base address setting 2022-07-17 11:55:33 +01:00
linux mm: fix PageAnonExclusive clearing racing with concurrent RCU GUP-fast 2022-09-11 20:26:11 -07:00
math-emu
media SPDX changes for 6.0-rc1 2022-08-04 12:12:54 -07:00
memory
misc
net Merge git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf 2022-08-24 19:18:10 -07:00
pcmcia
ras mm, hwpoison: enable memory error handling on 1GB hugepage 2022-08-08 18:06:44 -07:00
rdma dma-mapping updates 2022-08-06 10:56:45 -07:00
rv Documentation/rv: Add deterministic automata monitor synthesis documentation 2022-07-30 14:01:29 -04:00
scsi SCSI misc on 20220813 2022-08-13 13:41:48 -07:00
soc net: mscc: ocelot: keep ocelot_stat_layout by reg address, not offset 2022-08-17 21:58:32 -07:00
sound ASoC: More updates for v5.20 2022-08-01 15:26:40 +02:00
target scsi: target: core: De-RCU of se_lun and se_lun acl 2022-08-01 19:36:02 -04:00
trace mm/khugepaged: record SCAN_PMD_MAPPED when scan_pmd() finds hugepage 2022-09-11 20:25:46 -07:00
uapi userfaultfd: add /dev/userfaultfd for fine grained access control 2022-09-11 20:25:48 -07:00
ufs scsi: ufs: core: Enable link lost interrupt 2022-08-11 22:04:32 -04:00
vdso
video
xen x86/xen: Add support for HVMOP_set_evtchn_upcall_vector 2022-08-12 11:28:21 +02:00