Davidlohr Bueso df2923
From 11752adb68a388724b1935d57bf543897c34d80b Mon Sep 17 00:00:00 2001
Davidlohr Bueso df2923
From: Waiman Long <longman@redhat.com>
Davidlohr Bueso df2923
Date: Tue, 7 Nov 2017 16:18:06 -0500
Davidlohr Bueso df2923
Subject: [PATCH] locking/pvqspinlock: Implement hybrid PV queued/unfair locks
Davidlohr Bueso df2923
Git-commit: 11752adb68a388724b1935d57bf543897c34d80b
Jeff Mahoney 1f975c
Patch-mainline: v4.15-rc1
Davidlohr Bueso df2923
References: bsc#1050549
Davidlohr Bueso df2923
Davidlohr Bueso df2923
Currently, all the lock waiters entering the slowpath will do one
Davidlohr Bueso df2923
lock stealing attempt to acquire the lock. That helps performance,
Davidlohr Bueso df2923
especially in VMs with over-committed vCPUs. However, the current
Davidlohr Bueso df2923
pvqspinlocks still don't perform as good as unfair locks in many cases.
Davidlohr Bueso df2923
On the other hands, unfair locks do have the problem of lock starvation
Davidlohr Bueso df2923
that pvqspinlocks don't have.
Davidlohr Bueso df2923
Davidlohr Bueso df2923
This patch combines the best attributes of an unfair lock and a
Davidlohr Bueso df2923
pvqspinlock into a hybrid lock with 2 modes - queued mode & unfair
Davidlohr Bueso df2923
mode. A lock waiter goes into the unfair mode when there are waiters
Davidlohr Bueso df2923
in the wait queue but the pending bit isn't set. Otherwise, it will
Davidlohr Bueso df2923
go into the queued mode waiting in the queue for its turn.
Davidlohr Bueso df2923
Davidlohr Bueso df2923
On a 2-socket 36-core E5-2699 v3 system (HT off), a kernel build
Davidlohr Bueso df2923
(make -j<n>) was done in a VM with unpinned vCPUs 3 times with the
Davidlohr Bueso df2923
best time selected and <n> is the number of vCPUs available. The build
Davidlohr Bueso df2923
times of the original pvqspinlock, hybrid pvqspinlock and unfair lock
Davidlohr Bueso df2923
with various number of vCPUs are as follows:
Davidlohr Bueso df2923
Davidlohr Bueso df2923
  vCPUs    pvqlock     hybrid pvqlock    unfair lock
Davidlohr Bueso df2923
  -----    -------     --------------    -----------
Davidlohr Bueso df2923
    30      342.1s         329.1s          329.1s
Davidlohr Bueso df2923
    36      314.1s         305.3s          307.3s
Davidlohr Bueso df2923
    45      345.0s         302.1s          306.6s
Davidlohr Bueso df2923
    54      365.4s         308.6s          307.8s
Davidlohr Bueso df2923
    72      358.9s         293.6s          303.9s
Davidlohr Bueso df2923
   108      343.0s         285.9s          304.2s
Davidlohr Bueso df2923
Davidlohr Bueso df2923
The hybrid pvqspinlock performs better or comparable to the unfair
Davidlohr Bueso df2923
lock.
Davidlohr Bueso df2923
Davidlohr Bueso df2923
By turning on QUEUED_LOCK_STAT, the table below showed the number
Davidlohr Bueso df2923
of lock acquisitions in unfair mode and queue mode after a kernel
Davidlohr Bueso df2923
build with various number of vCPUs.
Davidlohr Bueso df2923
Davidlohr Bueso df2923
  vCPUs    queued mode  unfair mode
Davidlohr Bueso df2923
  -----    -----------  -----------
Davidlohr Bueso df2923
    30      9,130,518      294,954
Davidlohr Bueso df2923
    36     10,856,614      386,809
Davidlohr Bueso df2923
    45      8,467,264   11,475,373
Davidlohr Bueso df2923
    54      6,409,987   19,670,855
Davidlohr Bueso df2923
    72      4,782,063   25,712,180
Davidlohr Bueso df2923
Davidlohr Bueso df2923
It can be seen that as the VM became more and more over-committed,
Davidlohr Bueso df2923
the ratio of locks acquired in unfair mode increases. This is all
Davidlohr Bueso df2923
done automatically to get the best overall performance as possible.
Davidlohr Bueso df2923
Davidlohr Bueso df2923
Using a kernel locking microbenchmark with number of locking
Davidlohr Bueso df2923
threads equals to the number of vCPUs available on the same machine,
Davidlohr Bueso df2923
the minimum, average and maximum (min/avg/max) numbers of locking
Davidlohr Bueso df2923
operations done per thread in a 5-second testing interval are shown
Davidlohr Bueso df2923
Below: 
Davidlohr Bueso df2923
Davidlohr Bueso df2923
  vCPUs         hybrid pvqlock             unfair lock
Davidlohr Bueso df2923
  -----         --------------             -----------
Davidlohr Bueso df2923
    36     822,135/881,063/950,363    75,570/313,496/  690,465
Davidlohr Bueso df2923
    54     542,435/581,664/625,937    35,460/204,280/  457,172
Davidlohr Bueso df2923
    72     397,500/428,177/499,299    17,933/150,679/  708,001
Davidlohr Bueso df2923
   108     257,898/288,150/340,871     3,085/181,176/1,257,109
Davidlohr Bueso df2923
Davidlohr Bueso df2923
It can be seen that the hybrid pvqspinlocks are more fair and
Davidlohr Bueso df2923
performant than the unfair locks in this test.
Davidlohr Bueso df2923
Davidlohr Bueso df2923
The table below shows the kernel build times on a smaller 2-socket
Davidlohr Bueso df2923
16-core 32-thread E5-2620 v4 system.
Davidlohr Bueso df2923
Davidlohr Bueso df2923
  vCPUs    pvqlock     hybrid pvqlock    unfair lock
Davidlohr Bueso df2923
  -----    -------     --------------    -----------
Davidlohr Bueso df2923
    16      436.8s         433.4s          435.6s
Davidlohr Bueso df2923
    36      366.2s         364.8s          364.5s
Davidlohr Bueso df2923
    48      423.6s         376.3s          370.2s
Davidlohr Bueso df2923
    64      433.1s         376.6s          376.8s
Davidlohr Bueso df2923
Davidlohr Bueso df2923
Again, the performance of the hybrid pvqspinlock was comparable to
Davidlohr Bueso df2923
that of the unfair lock.
Davidlohr Bueso df2923
Davidlohr Bueso df2923
Signed-off-by: Waiman Long <longman@redhat.com>
Davidlohr Bueso df2923
Reviewed-by: Juergen Gross <jgross@suse.com>
Davidlohr Bueso df2923
Reviewed-by: Eduardo Valentin <eduval@amazon.com>
Davidlohr Bueso df2923
Acked-by: Peter Zijlstra <peterz@infradead.org>
Davidlohr Bueso df2923
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Davidlohr Bueso df2923
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Davidlohr Bueso df2923
Cc: Paolo Bonzini <pbonzini@redhat.com>
Davidlohr Bueso df2923
Cc: Radim Krčmář <rkrcmar@redhat.com>
Davidlohr Bueso df2923
Cc: Thomas Gleixner <tglx@linutronix.de>
Davidlohr Bueso df2923
Link: http://lkml.kernel.org/r/1510089486-3466-1-git-send-email-longman@redhat.com
Davidlohr Bueso df2923
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Davidlohr Bueso df2923
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Davidlohr Bueso df2923
Davidlohr Bueso df2923
---
Davidlohr Bueso df2923
 kernel/locking/qspinlock_paravirt.h | 47 ++++++++++++++++++++++++++++++-------
Davidlohr Bueso df2923
 1 file changed, 38 insertions(+), 9 deletions(-)
Davidlohr Bueso df2923
Davidlohr Bueso df2923
diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
Davidlohr Bueso df2923
index 15b6a39366c6..6ee477765e6c 100644
Davidlohr Bueso df2923
--- a/kernel/locking/qspinlock_paravirt.h
Davidlohr Bueso df2923
+++ b/kernel/locking/qspinlock_paravirt.h
Davidlohr Bueso df2923
@@ -61,21 +61,50 @@ struct pv_node {
Davidlohr Bueso df2923
 #include "qspinlock_stat.h"
Davidlohr Bueso df2923
 
Davidlohr Bueso df2923
 /*
Davidlohr Bueso df2923
+ * Hybrid PV queued/unfair lock
Davidlohr Bueso df2923
+ *
Davidlohr Bueso df2923
  * By replacing the regular queued_spin_trylock() with the function below,
Davidlohr Bueso df2923
  * it will be called once when a lock waiter enter the PV slowpath before
Davidlohr Bueso df2923
- * being queued. By allowing one lock stealing attempt here when the pending
Davidlohr Bueso df2923
- * bit is off, it helps to reduce the performance impact of lock waiter
Davidlohr Bueso df2923
- * preemption without the drawback of lock starvation.
Davidlohr Bueso df2923
+ * being queued.
Davidlohr Bueso df2923
+ *
Davidlohr Bueso df2923
+ * The pending bit is set by the queue head vCPU of the MCS wait queue in
Davidlohr Bueso df2923
+ * pv_wait_head_or_lock() to signal that it is ready to spin on the lock.
Davidlohr Bueso df2923
+ * When that bit becomes visible to the incoming waiters, no lock stealing
Davidlohr Bueso df2923
+ * is allowed. The function will return immediately to make the waiters
Davidlohr Bueso df2923
+ * enter the MCS wait queue. So lock starvation shouldn't happen as long
Davidlohr Bueso df2923
+ * as the queued mode vCPUs are actively running to set the pending bit
Davidlohr Bueso df2923
+ * and hence disabling lock stealing.
Davidlohr Bueso df2923
+ *
Davidlohr Bueso df2923
+ * When the pending bit isn't set, the lock waiters will stay in the unfair
Davidlohr Bueso df2923
+ * mode spinning on the lock unless the MCS wait queue is empty. In this
Davidlohr Bueso df2923
+ * case, the lock waiters will enter the queued mode slowpath trying to
Davidlohr Bueso df2923
+ * become the queue head and set the pending bit.
Davidlohr Bueso df2923
+ *
Davidlohr Bueso df2923
+ * This hybrid PV queued/unfair lock combines the best attributes of a
Davidlohr Bueso df2923
+ * queued lock (no lock starvation) and an unfair lock (good performance
Davidlohr Bueso df2923
+ * on not heavily contended locks).
Davidlohr Bueso df2923
  */
Davidlohr Bueso df2923
-#define queued_spin_trylock(l)	pv_queued_spin_steal_lock(l)
Davidlohr Bueso df2923
-static inline bool pv_queued_spin_steal_lock(struct qspinlock *lock)
Davidlohr Bueso df2923
+#define queued_spin_trylock(l)	pv_hybrid_queued_unfair_trylock(l)
Davidlohr Bueso df2923
+static inline bool pv_hybrid_queued_unfair_trylock(struct qspinlock *lock)
Davidlohr Bueso df2923
 {
Davidlohr Bueso df2923
 	struct __qspinlock *l = (void *)lock;
Davidlohr Bueso df2923
 
Davidlohr Bueso df2923
-	if (!(atomic_read(&lock->val) & _Q_LOCKED_PENDING_MASK) &&
Davidlohr Bueso df2923
-	    (cmpxchg_acquire(&l->locked, 0, _Q_LOCKED_VAL) == 0)) {
Davidlohr Bueso df2923
-		qstat_inc(qstat_pv_lock_stealing, true);
Davidlohr Bueso df2923
-		return true;
Davidlohr Bueso df2923
+	/*
Davidlohr Bueso df2923
+	 * Stay in unfair lock mode as long as queued mode waiters are
Davidlohr Bueso df2923
+	 * present in the MCS wait queue but the pending bit isn't set.
Davidlohr Bueso df2923
+	 */
Davidlohr Bueso df2923
+	for (;;) {
Davidlohr Bueso df2923
+		int val = atomic_read(&lock->val);
Davidlohr Bueso df2923
+
Davidlohr Bueso df2923
+		if (!(val & _Q_LOCKED_PENDING_MASK) &&
Davidlohr Bueso df2923
+		   (cmpxchg_acquire(&l->locked, 0, _Q_LOCKED_VAL) == 0)) {
Davidlohr Bueso df2923
+			qstat_inc(qstat_pv_lock_stealing, true);
Davidlohr Bueso df2923
+			return true;
Davidlohr Bueso df2923
+		}
Davidlohr Bueso df2923
+		if (!(val & _Q_TAIL_MASK) || (val & _Q_PENDING_MASK))
Davidlohr Bueso df2923
+			break;
Davidlohr Bueso df2923
+
Davidlohr Bueso df2923
+		cpu_relax();
Davidlohr Bueso df2923
 	}
Davidlohr Bueso df2923
 
Davidlohr Bueso df2923
 	return false;
Davidlohr Bueso df2923
-- 
Davidlohr Bueso df2923
2.13.6
Davidlohr Bueso df2923