Josh Boyer 440886d
Path: news.gmane.org!not-for-mail
Josh Boyer 440886d
From: Andrea Arcangeli <aarcange@redhat.com>
Josh Boyer 440886d
Newsgroups: gmane.linux.kernel.mm
Josh Boyer 440886d
Subject: [PATCH] mm: pmd_read_atomic: fix 32bit PAE pmd walk vs pmd_populate SMP race condition
Josh Boyer 440886d
Date: Thu, 24 May 2012 01:39:01 +0200
Josh Boyer 440886d
Lines: 208
Josh Boyer 440886d
Approved: news@gmane.org
Josh Boyer 440886d
Message-ID: <1337816341-30743-1-git-send-email-aarcange@redhat.com>
Josh Boyer 440886d
References: <20120518230028.GF32479@redhat.com>
Josh Boyer 440886d
NNTP-Posting-Host: plane.gmane.org
Josh Boyer 440886d
X-Trace: dough.gmane.org 1337816354 18906 80.91.229.3 (23 May 2012 23:39:14 GMT)
Josh Boyer 440886d
X-Complaints-To: usenet@dough.gmane.org
Josh Boyer 440886d
NNTP-Posting-Date: Wed, 23 May 2012 23:39:14 +0000 (UTC)
Josh Boyer 440886d
Cc: Andrew Morton <akpm@linux-foundation.org>, Mel Gorman <mgorman@suse.de>,
Josh Boyer 440886d
        Hugh Dickins <hughd@google.com>, Larry Woodman <lwoodman@redhat.com>,
Josh Boyer 440886d
        Petr Matousek <pmatouse@redhat.com>,
Josh Boyer 440886d
        Ulrich Obergfell <uobergfe@redhat.com>, Rik van Riel <riel@redhat.com>
Josh Boyer 440886d
To: linux-mm@kvack.org
Josh Boyer 440886d
Original-X-From: owner-linux-mm@kvack.org Thu May 24 01:39:12 2012
Josh Boyer 440886d
Return-path: <owner-linux-mm@kvack.org>
Josh Boyer 440886d
Envelope-to: glkm-linux-mm-2@m.gmane.org
Josh Boyer 440886d
Original-Received: from kanga.kvack.org ([205.233.56.17])
Josh Boyer 440886d
	by plane.gmane.org with esmtp (Exim 4.69)
Josh Boyer 440886d
	(envelope-from <owner-linux-mm@kvack.org>)
Josh Boyer 440886d
	id 1SXL94-0002ub-3P
Josh Boyer 440886d
	for glkm-linux-mm-2@m.gmane.org; Thu, 24 May 2012 01:39:10 +0200
Josh Boyer 440886d
Original-Received: by kanga.kvack.org (Postfix)
Josh Boyer 440886d
	id 1684A6B0083; Wed, 23 May 2012 19:39:09 -0400 (EDT)
Josh Boyer 440886d
Delivered-To: linux-mm-outgoing@kvack.org
Josh Boyer 440886d
Original-Received: by kanga.kvack.org (Postfix, from userid 40)
Josh Boyer 440886d
	id 080DD6B0092; Wed, 23 May 2012 19:39:08 -0400 (EDT)
Josh Boyer 440886d
X-Original-To: int-list-linux-mm@kvack.org
Josh Boyer 440886d
Delivered-To: int-list-linux-mm@kvack.org
Josh Boyer 440886d
Original-Received: by kanga.kvack.org (Postfix, from userid 63042)
Josh Boyer 440886d
	id C84046B00E7; Wed, 23 May 2012 19:39:08 -0400 (EDT)
Josh Boyer 440886d
X-Original-To: linux-mm@kvack.org
Josh Boyer 440886d
Delivered-To: linux-mm@kvack.org
Josh Boyer 440886d
Original-Received: from psmtp.com (na3sys010amx119.postini.com [74.125.245.119])
Josh Boyer 440886d
	by kanga.kvack.org (Postfix) with SMTP id 0B2DC6B0083
Josh Boyer 440886d
	for <linux-mm@kvack.org>; Wed, 23 May 2012 19:39:07 -0400 (EDT)
Josh Boyer 440886d
Original-Received: from mx1.redhat.com ([209.132.183.28]) (using TLSv1) by na3sys010amx119.postini.com ([74.125.244.10]) with SMTP;
Josh Boyer 440886d
	Wed, 23 May 2012 18:39:08 CDT
Josh Boyer 440886d
Original-Received: from int-mx12.intmail.prod.int.phx2.redhat.com (int-mx12.intmail.prod.int.phx2.redhat.com [10.5.11.25])
Josh Boyer 440886d
	by mx1.redhat.com (8.14.4/8.14.4) with ESMTP id q4NNd3dP002492
Josh Boyer 440886d
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK);
Josh Boyer 440886d
	Wed, 23 May 2012 19:39:03 -0400
Josh Boyer 440886d
Original-Received: from random.random (ovpn-113-72.phx2.redhat.com [10.3.113.72])
Josh Boyer 440886d
	by int-mx12.intmail.prod.int.phx2.redhat.com (8.14.4/8.14.4) with ESMTP id q4NNd1P7012233;
Josh Boyer 440886d
	Wed, 23 May 2012 19:39:02 -0400
Josh Boyer 440886d
In-Reply-To: <20120518230028.GF32479@redhat.com>
Josh Boyer 440886d
X-Scanned-By: MIMEDefang 2.68 on 10.5.11.25
Josh Boyer 440886d
X-pstn-neptune: 0/0/0.00/0
Josh Boyer 440886d
X-pstn-levels: (S:99.90000/99.90000 CV:99.9000 FC:95.5390 LC:95.5390 R:95.9108 P:95.9108 M:97.0282 C:98.6951 )
Josh Boyer 440886d
X-pstn-dkim: 0 skipped:not-enabled
Josh Boyer 440886d
X-pstn-settings: 3 (1.0000:1.0000) s cv gt3 gt2 gt1 r p m c 
Josh Boyer 440886d
X-pstn-addresses: from <aarcange@redhat.com> [db-null] 
Josh Boyer 440886d
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.2
Josh Boyer 440886d
Original-Sender: owner-linux-mm@kvack.org
Josh Boyer 440886d
Precedence: bulk
Josh Boyer 440886d
X-Loop: owner-majordomo@kvack.org
Josh Boyer 440886d
List-ID: <linux-mm.kvack.org>
Josh Boyer 440886d
Xref: news.gmane.org gmane.linux.kernel.mm:78936
Josh Boyer 440886d
Archived-At: <http://permalink.gmane.org/gmane.linux.kernel.mm/78936>
Josh Boyer 440886d
Josh Boyer 440886d
When holding the mmap_sem for reading, pmd_offset_map_lock should only
Josh Boyer 440886d
run on a pmd_t that has been read atomically from the pmdp
Josh Boyer 440886d
pointer, otherwise we may read only half of it leading to this crash.
Josh Boyer 440886d
Josh Boyer 440886d
PID: 11679  TASK: f06e8000  CPU: 3   COMMAND: "do_race_2_panic"
Josh Boyer 440886d
 #0 [f06a9dd8] crash_kexec at c049b5ec
Josh Boyer 440886d
 #1 [f06a9e2c] oops_end at c083d1c2
Josh Boyer 440886d
 #2 [f06a9e40] no_context at c0433ded
Josh Boyer 440886d
 #3 [f06a9e64] bad_area_nosemaphore at c043401a
Josh Boyer 440886d
 #4 [f06a9e6c] __do_page_fault at c0434493
Josh Boyer 440886d
 #5 [f06a9eec] do_page_fault at c083eb45
Josh Boyer 440886d
 #6 [f06a9f04] error_code (via page_fault) at c083c5d5
Josh Boyer 440886d
    EAX: 01fb470c EBX: fff35000 ECX: 00000003 EDX: 00000100 EBP:
Josh Boyer 440886d
    00000000
Josh Boyer 440886d
    DS:  007b     ESI: 9e201000 ES:  007b     EDI: 01fb4700 GS:  00e0
Josh Boyer 440886d
    CS:  0060     EIP: c083bc14 ERR: ffffffff EFLAGS: 00010246
Josh Boyer 440886d
 #7 [f06a9f38] _spin_lock at c083bc14
Josh Boyer 440886d
 #8 [f06a9f44] sys_mincore at c0507b7d
Josh Boyer 440886d
 #9 [f06a9fb0] system_call at c083becd
Josh Boyer 440886d
                         start           len
Josh Boyer 440886d
    EAX: ffffffda  EBX: 9e200000  ECX: 00001000  EDX: 6228537f
Josh Boyer 440886d
    DS:  007b      ESI: 00000000  ES:  007b      EDI: 003d0f00
Josh Boyer 440886d
    SS:  007b      ESP: 62285354  EBP: 62285388  GS:  0033
Josh Boyer 440886d
    CS:  0073      EIP: 00291416  ERR: 000000da  EFLAGS: 00000286
Josh Boyer 440886d
Josh Boyer 440886d
This should be a longstanding bug affecting x86 32bit PAE without
Josh Boyer 440886d
THP. Only archs with 64bit large pmd_t and 32bit unsigned long should
Josh Boyer 440886d
be affected.
Josh Boyer 440886d
Josh Boyer 440886d
With THP enabled the barrier() in
Josh Boyer 440886d
pmd_none_or_trans_huge_or_clear_bad() would partly hide the bug when
Josh Boyer 440886d
the pmd transition from none to stable, by forcing a re-read of the
Josh Boyer 440886d
*pmd in pmd_offset_map_lock, but when THP is enabled a new set of
Josh Boyer 440886d
problem arises by the fact could then transition freely in any of the
Josh Boyer 440886d
none, pmd_trans_huge or pmd_trans_stable states. So making the barrier
Josh Boyer 440886d
in pmd_none_or_trans_huge_or_clear_bad() unconditional isn't good idea
Josh Boyer 440886d
and it would be a flakey solution.
Josh Boyer 440886d
Josh Boyer 440886d
This should be fully fixed by introducing a pmd_read_atomic that reads
Josh Boyer 440886d
the pmd in order with THP disabled, or by reading the pmd atomically
Josh Boyer 440886d
with cmpxchg8b with THP enabled.
Josh Boyer 440886d
Josh Boyer 440886d
Luckily this new race condition only triggers in the places that must
Josh Boyer 440886d
already be covered by pmd_none_or_trans_huge_or_clear_bad() so the fix
Josh Boyer 440886d
is localized there but this bug is not related to THP.
Josh Boyer 440886d
Josh Boyer 440886d
NOTE: this can trigger on x86 32bit systems with PAE enabled with more
Josh Boyer 440886d
than 4G of ram, otherwise the high part of the pmd will never risk to
Josh Boyer 440886d
be truncated because it would be zero at all times, in turn so hiding
Josh Boyer 440886d
the SMP race.
Josh Boyer 440886d
Josh Boyer 440886d
This bug was discovered and fully debugged by Ulrich, quote:
Josh Boyer 440886d
Josh Boyer 440886d
----
Josh Boyer 440886d
[..]
Josh Boyer 440886d
pmd_none_or_trans_huge_or_clear_bad() loads the content of edx and
Josh Boyer 440886d
eax.
Josh Boyer 440886d
Josh Boyer 440886d
    496 static inline int pmd_none_or_trans_huge_or_clear_bad(pmd_t
Josh Boyer 440886d
    *pmd)
Josh Boyer 440886d
    497 {
Josh Boyer 440886d
    498         /* depend on compiler for an atomic pmd read */
Josh Boyer 440886d
    499         pmd_t pmdval = *pmd;
Josh Boyer 440886d
Josh Boyer 440886d
                                // edi = pmd pointer
Josh Boyer 440886d
0xc0507a74 <sys_mincore+548>:   mov    0x8(%esp),%edi
Josh Boyer 440886d
...
Josh Boyer 440886d
                                // edx = PTE page table high address
Josh Boyer 440886d
0xc0507a84 <sys_mincore+564>:   mov    0x4(%edi),%edx
Josh Boyer 440886d
...
Josh Boyer 440886d
                                // eax = PTE page table low address
Josh Boyer 440886d
0xc0507a8e <sys_mincore+574>:   mov    (%edi),%eax
Josh Boyer 440886d
Josh Boyer 440886d
[..]
Josh Boyer 440886d
Josh Boyer 440886d
Please note that the PMD is not read atomically. These are two "mov"
Josh Boyer 440886d
instructions where the high order bits of the PMD entry are fetched
Josh Boyer 440886d
first. Hence, the above machine code is prone to the following race.
Josh Boyer 440886d
Josh Boyer 440886d
-  The PMD entry {high|low} is 0x0000000000000000.
Josh Boyer 440886d
   The "mov" at 0xc0507a84 loads 0x00000000 into edx.
Josh Boyer 440886d
Josh Boyer 440886d
-  A page fault (on another CPU) sneaks in between the two "mov"
Josh Boyer 440886d
   instructions and instantiates the PMD.
Josh Boyer 440886d
Josh Boyer 440886d
-  The PMD entry {high|low} is now 0x00000003fda38067.
Josh Boyer 440886d
   The "mov" at 0xc0507a8e loads 0xfda38067 into eax.
Josh Boyer 440886d
----
Josh Boyer 440886d
Josh Boyer 440886d
Reported-by: Ulrich Obergfell <uobergfe@redhat.com>
Josh Boyer 440886d
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Josh Boyer 440886d
---
Josh Boyer 440886d
 arch/x86/include/asm/pgtable-3level.h |   50 +++++++++++++++++++++++++++++++++
Josh Boyer 440886d
 include/asm-generic/pgtable.h         |   22 +++++++++++++-
Josh Boyer 440886d
 2 files changed, 70 insertions(+), 2 deletions(-)
Josh Boyer 440886d
Josh Boyer 440886d
diff --git a/arch/x86/include/asm/pgtable-3level.h b/arch/x86/include/asm/pgtable-3level.h
Josh Boyer 440886d
index effff47..43876f1 100644
Josh Boyer 440886d
--- a/arch/x86/include/asm/pgtable-3level.h
Josh Boyer 440886d
+++ b/arch/x86/include/asm/pgtable-3level.h
Josh Boyer 440886d
@@ -31,6 +31,56 @@ static inline void native_set_pte(pte_t *ptep, pte_t pte)
Josh Boyer 440886d
 	ptep->pte_low = pte.pte_low;
Josh Boyer 440886d
 }
Josh Boyer 440886d
 
Josh Boyer 440886d
+#define pmd_read_atomic pmd_read_atomic
Josh Boyer 440886d
+/*
Josh Boyer 440886d
+ * pte_offset_map_lock on 32bit PAE kernels was reading the pmd_t with
Josh Boyer 440886d
+ * a "*pmdp" dereference done by gcc. Problem is, in certain places
Josh Boyer 440886d
+ * where pte_offset_map_lock is called, concurrent page faults are
Josh Boyer 440886d
+ * allowed, if the mmap_sem is hold for reading. An example is mincore
Josh Boyer 440886d
+ * vs page faults vs MADV_DONTNEED. On the page fault side
Josh Boyer 440886d
+ * pmd_populate rightfully does a set_64bit, but if we're reading the
Josh Boyer 440886d
+ * pmd_t with a "*pmdp" on the mincore side, a SMP race can happen
Josh Boyer 440886d
+ * because gcc will not read the 64bit of the pmd atomically. To fix
Josh Boyer 440886d
+ * this all places running pmd_offset_map_lock() while holding the
Josh Boyer 440886d
+ * mmap_sem in read mode, shall read the pmdp pointer using this
Josh Boyer 440886d
+ * function to know if the pmd is null nor not, and in turn to know if
Josh Boyer 440886d
+ * they can run pmd_offset_map_lock or pmd_trans_huge or other pmd
Josh Boyer 440886d
+ * operations.
Josh Boyer 440886d
+ *
Josh Boyer 440886d
+ * Without THP if the mmap_sem is hold for reading, the
Josh Boyer 440886d
+ * pmd can only transition from null to not null while pmd_read_atomic runs.
Josh Boyer 440886d
+ * So there's no need of literally reading it atomically.
Josh Boyer 440886d
+ *
Josh Boyer 440886d
+ * With THP if the mmap_sem is hold for reading, the pmd can become
Josh Boyer 440886d
+ * THP or null or point to a pte (and in turn become "stable") at any
Josh Boyer 440886d
+ * time under pmd_read_atomic, so it's mandatory to read it atomically
Josh Boyer 440886d
+ * with cmpxchg8b.
Josh Boyer 440886d
+ */
Josh Boyer 440886d
+#ifndef CONFIG_TRANSPARENT_HUGEPAGE
Josh Boyer 440886d
+static inline pmd_t pmd_read_atomic(pmd_t *pmdp)
Josh Boyer 440886d
+{
Josh Boyer 440886d
+	pmdval_t ret;
Josh Boyer 440886d
+	u32 *tmp = (u32 *)pmdp;
Josh Boyer 440886d
+
Josh Boyer 440886d
+	ret = (pmdval_t) (*tmp);
Josh Boyer 440886d
+	if (ret) {
Josh Boyer 440886d
+		/*
Josh Boyer 440886d
+		 * If the low part is null, we must not read the high part
Josh Boyer 440886d
+		 * or we can end up with a partial pmd.
Josh Boyer 440886d
+		 */
Josh Boyer 440886d
+		smp_rmb();
Josh Boyer 440886d
+		ret |= ((pmdval_t)*(tmp + 1)) << 32;
Josh Boyer 440886d
+	}
Josh Boyer 440886d
+
Josh Boyer 440886d
+	return (pmd_t) { ret };
Josh Boyer 440886d
+}
Josh Boyer 440886d
+#else /* CONFIG_TRANSPARENT_HUGEPAGE */
Josh Boyer 440886d
+static inline pmd_t pmd_read_atomic(pmd_t *pmdp)
Josh Boyer 440886d
+{
Josh Boyer 440886d
+	return (pmd_t) { atomic64_read((atomic64_t *)pmdp) };
Josh Boyer 440886d
+}
Josh Boyer 440886d
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
Josh Boyer 440886d
+
Josh Boyer 440886d
 static inline void native_set_pte_atomic(pte_t *ptep, pte_t pte)
Josh Boyer 440886d
 {
Josh Boyer 440886d
 	set_64bit((unsigned long long *)(ptep), native_pte_val(pte));
Josh Boyer 440886d
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
Josh Boyer 440886d
index 125c54e..fa596d9 100644
Josh Boyer 440886d
--- a/include/asm-generic/pgtable.h
Josh Boyer 440886d
+++ b/include/asm-generic/pgtable.h
Josh Boyer 440886d
@@ -446,6 +446,18 @@ static inline int pmd_write(pmd_t pmd)
Josh Boyer 440886d
 #endif /* __HAVE_ARCH_PMD_WRITE */
Josh Boyer 440886d
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
Josh Boyer 440886d
 
Josh Boyer 440886d
+#ifndef pmd_read_atomic
Josh Boyer 440886d
+static inline pmd_t pmd_read_atomic(pmd_t *pmdp)
Josh Boyer 440886d
+{
Josh Boyer 440886d
+	/*
Josh Boyer 440886d
+	 * Depend on compiler for an atomic pmd read. NOTE: this is
Josh Boyer 440886d
+	 * only going to work, if the pmdval_t isn't larger than
Josh Boyer 440886d
+	 * an unsigned long.
Josh Boyer 440886d
+	 */
Josh Boyer 440886d
+	return *pmdp;
Josh Boyer 440886d
+}
Josh Boyer 440886d
+#endif
Josh Boyer 440886d
+
Josh Boyer 440886d
 /*
Josh Boyer 440886d
  * This function is meant to be used by sites walking pagetables with
Josh Boyer 440886d
  * the mmap_sem hold in read mode to protect against MADV_DONTNEED and
Josh Boyer 440886d
@@ -459,11 +471,17 @@ static inline int pmd_write(pmd_t pmd)
Josh Boyer 440886d
  * undefined so behaving like if the pmd was none is safe (because it
Josh Boyer 440886d
  * can return none anyway). The compiler level barrier() is critically
Josh Boyer 440886d
  * important to compute the two checks atomically on the same pmdval.
Josh Boyer 440886d
+ *
Josh Boyer 440886d
+ * For 32bit kernels with a 64bit large pmd_t this automatically takes
Josh Boyer 440886d
+ * care of reading the pmd atomically to avoid SMP race conditions
Josh Boyer 440886d
+ * against pmd_populate() when the mmap_sem is hold for reading by the
Josh Boyer 440886d
+ * caller (a special atomic read not done by "gcc" as in the generic
Josh Boyer 440886d
+ * version above, is also needed when THP is disabled because the page
Josh Boyer 440886d
+ * fault can populate the pmd from under us).
Josh Boyer 440886d
  */
Josh Boyer 440886d
 static inline int pmd_none_or_trans_huge_or_clear_bad(pmd_t *pmd)
Josh Boyer 440886d
 {
Josh Boyer 440886d
-	/* depend on compiler for an atomic pmd read */
Josh Boyer 440886d
-	pmd_t pmdval = *pmd;
Josh Boyer 440886d
+	pmd_t pmdval = pmd_read_atomic(pmd);
Josh Boyer 440886d
 	/*
Josh Boyer 440886d
 	 * The barrier will stabilize the pmdval in a register or on
Josh Boyer 440886d
 	 * the stack so that it will stop changing under the code.
Josh Boyer 440886d
Josh Boyer 440886d
--
Josh Boyer 440886d
To unsubscribe, send a message with 'unsubscribe linux-mm' in
Josh Boyer 440886d
the body to majordomo@kvack.org.  For more info on Linux MM,
Josh Boyer 440886d
see: http://www.linux-mm.org/ .
Josh Boyer 440886d
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Josh Boyer 440886d
Don't email:  email@kvack.org 
Josh Boyer 440886d