Blob Blame History Raw
From 5ef3f04687a5fdca7ccf1a40c426d9db4498d2c7 Mon Sep 17 00:00:00 2001
From: Matt Fleming <matt@codeblueprint.co.uk>
Date: Thu, 7 May 2020 12:15:43 +0100
Subject: [PATCH] x86/asm/64: Align start of __clear_user() loop to 16-bytes
Patch-mainline: v5.8-rc3
Git-commit: bb5570ad3b54e7930997aec76ab68256d5236d94
References: bsc#1168461

x86 CPUs can suffer severe performance drops if a tight loop, such as
the ones in __clear_user(), straddles a 16-byte instruction fetch
window. This movement seems to have happened in SLE15-SP2 with the
following commit,

  1153933703d9 ("x86/asm/64: Micro-optimize __clear_user() - Use immediate constants")

which increased the code object size from 10 bytes to 15 bytes and
caused the 8-byte copy loop in __clear_user() to be split across two
16-byte lines.

Aligning the start of the loop to 16-bytes makes this fit neatly
inside a single instruction fetch window again and improves the
performance of __clear_user() which is used heavily when reading from
/dev/zero.

Zen 1 (Naples)

libmicro-file
                                        4.12.14                 5.3.18                 5.3.18
                          default-gc5a4c91e56cd  default-g91efaf24bedf  align16-g91efaf24bedf
Time mean95-pread_z100k       5.8464 (   0.00%)      9.6102 ( -64.38%)      6.0099 (  -2.80%)
Time mean95-pread_z10k        0.7180 (   0.00%)      1.0994 ( -53.12%)      0.7476 (  -4.13%)
Time mean95-pread_z1k         0.2089 (   0.00%)      0.2568 ( -22.96%)      0.2255 (  -7.98%)
Time mean95-pread_zw100k      5.9265 (   0.00%)      9.6931 ( -63.55%)      6.0915 (  -2.78%)
Time mean95-read_z100k        5.8447 (   0.00%)      9.6114 ( -64.45%)      6.0107 (  -2.84%)
Time mean95-read_z10k         0.7173 (   0.00%)      1.1065 ( -54.26%)      0.7493 (  -4.46%)
Time mean95-read_z1k          0.2090 (   0.00%)      0.2514 ( -20.29%)      0.2275 (  -8.86%)
Time mean95-read_zw100k       5.9275 (   0.00%)      9.7080 ( -63.78%)      6.0951 (  -2.83%)

Zen 2 (Rome)

libmicro-file
                                        4.12.14                 5.3.18                 5.3.18
                          default-gc5a4c91e56cd  default-g91efaf24bedf  align16-g91efaf24bedf
Time mean95-pread_z100k       5.1452 (   0.00%)      8.2803 ( -60.93%)      5.1848 (  -0.77%)
Time mean95-pread_z10k        0.6274 (   0.00%)      0.9439 ( -50.45%)      0.6358 (  -1.34%)
Time mean95-pread_z1k         0.1771 (   0.00%)      0.2044 ( -15.44%)      0.1905 (  -7.60%)
Time mean95-pread_zw100k      5.2426 (   0.00%)      8.3323 ( -58.93%)      5.2394 (   0.06%)
Time mean95-read_z100k        5.2124 (   0.00%)      8.2830 ( -58.91%)      5.1860 (   0.51%)
Time mean95-read_z10k         0.6211 (   0.00%)      0.9455 ( -52.22%)      0.6378 (  -2.68%)
Time mean95-read_z1k          0.1770 (   0.00%)      0.2069 ( -16.84%)      0.1929 (  -8.97%)
Time mean95-read_zw100k       5.2022 (   0.00%)      8.3377 ( -60.27%)      5.2489 (  -0.90%)

Note that this doesn't affect Haswell or Broadwell microarchitectures
which can avoid the alignment issue by executing the loop straight out
of the Loop Stream Detector.

Fixes: 1153933703d9 ("x86/asm/64: Micro-optimize __clear_user() - Use immediate constants")
Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk>
Signed-off-by: Matt Fleming <mfleming@suse.de>
---
 arch/x86/lib/usercopy_64.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/lib/usercopy_64.c b/arch/x86/lib/usercopy_64.c
index fff28c6f73a2..b0dfac3d3df7 100644
--- a/arch/x86/lib/usercopy_64.c
+++ b/arch/x86/lib/usercopy_64.c
@@ -24,6 +24,7 @@ unsigned long __clear_user(void __user *addr, unsigned long size)
 	asm volatile(
 		"	testq  %[size8],%[size8]\n"
 		"	jz     4f\n"
+		"	.align 16\n"
 		"0:	movq $0,(%[dst])\n"
 		"	addq   $8,%[dst]\n"
 		"	decl %%ecx ; jnz   0b\n"
-- 
2.16.4