Blob Blame History Raw
From: Dexuan Cui <decui@microsoft.com>
Date: Thu, 14 Sep 2023 12:13:31 -0700
Subject: x86/hyperv: Add hv_write_efer() for a TDX VM with the paravisor
Patch-mainline: never, workaround for host bug
References: bsc#1206453

This is a temporary hack. The latest Hyper-V dev build has been
fixed, but the fix won't roll out to Azure until the end of 2023.

It's safe to have the hack on future fixed Hyper-V.

Note: the original workarouond (see commit
78b17e1bd229 ("x86/hyperv: Add hv_write_efer() for a TDX VM with the paravisor"))
adds CPUHP_AP_HYPERV_FORCE_EFER_WRITE, which turns out to be unnecessary:

the first hypercall invoked on a non-boot VP is the IPI hypercalls:

[    5.238850][    T1] smp: Bringing up secondary CPUs ...
[    5.241406][    T1] x86: Booting SMP configuration:
[    5.241818][    T1] .... node  #0, CPUs:          #1
[    5.243876][   T18] ------------[ cut here ]------------
[    5.245804][   T18] cdx: first hypercall
[    5.245804][   T18] WARNING: CPU: 1 PID: 18 at arch/x86/hyperv/hv_apic.c:243 __send_ipi_one+0x1b4/0x200
...
[    5.245804][   T18] CPU: 1 PID: 18 Comm: cpuhp/1 Not tainted 5.14.21-150400.22-default-decui-no-w-efer+ #6 SLE15-SP4 fefda0b177da1a9470efefee04adbe432f317fe6
[    5.245804][   T18] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v4.1 07/11/2023
[    5.245804][   T18] RIP: 0010:__send_ipi_one+0x1b4/0x200
...
[    5.245804][   T18] Call Trace:
[    5.245804][   T18]  <TASK>
[    5.245804][   T18]  hv_send_ipi+0x12/0x40
[    5.245804][   T18]  ttwu_queue_wakelist+0xef/0x110
[    5.245804][   T18]  try_to_wake_up+0x196/0x590
[    5.245804][   T18]  ? sched_cpu_activate+0xcd/0x180
[    5.245804][   T18]  swake_up_locked.part.0+0x13/0x30
[    5.245804][   T18]  complete+0x2f/0x40
[    5.245804][   T18]  cpuhp_thread_fun+0xb6/0x150
[    5.245804][   T18]  ? sort_range+0x20/0x20
[    5.245804][   T18]  smpboot_thread_fn+0xd8/0x1c0
[    5.245804][   T18]  kthread+0x15a/0x190
[    5.245804][   T18]  ? set_kthread_struct+0x50/0x50
[    5.245804][   T18]  ret_from_fork+0x22/0x30
[    5.245804][   T18]  </TASK>
[    5.245804][   T18] ---[ end trace 20694464dde8a516 ]---
...
[    5.245804][   T18] BUG: unable to handle page fault for address: 0000000000011003
[    5.245804][   T18] #PF: supervisor instruction fetch in kernel mode
[    5.245804][   T18] #PF: error_code(0x0010) - not-present page
[    5.245804][   T18] PGD 0 P4D 0
[    5.245804][   T18] Oops: 0010 [#1] PREEMPT SMP PTI
...
[    5.245804][   T18] RIP: 0010:0x11003
[    5.245804][   T18] Code: Unable to access opcode bytes at RIP 0x10fd9.
[    5.245804][   T18] RSP: 0000:ffffb997035e7db0 EFLAGS: 00010086
[    5.245804][   T18] RAX: 0000000000000002 RBX: 0000000000000000 RCX: 000000000001000b
[    5.245804][   T18] RDX: 0000000000000000 RSI: 00000000fffbffff RDI: 0000000000000001
[    5.245804][   T18] RBP: 00000000000000fb R08: 0000000000000001 R09: ffffb997035e7be8
[    5.245804][   T18] R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000001
[    5.245804][   T18] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[    5.245804][   T18] FS:  0000000000000000(0000) GS:ffff9252fce40000(0000) knlGS:0000000000000000
[    5.245804][   T18] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    5.245804][   T18] CR2: 0000000000011003 CR3: 00000003b4628001 CR4: 0000000000370ee0
[    5.245804][   T18] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[    5.245804][   T18] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
[    5.245804][   T18] Call Trace:
[    5.245804][   T18]  <TASK>
[    5.245804][   T18]  ? __send_ipi_one+0xc7/0x200
[    5.245804][   T18]  ? hv_send_ipi+0x12/0x40
[    5.245804][   T18]  ? ttwu_queue_wakelist+0xef/0x110
[    5.245804][   T18]  ? try_to_wake_up+0x196/0x590
[    5.245804][   T18]  ? sched_cpu_activate+0xcd/0x180
[    5.245804][   T18]  ? swake_up_locked.part.0+0x13/0x30
[    5.245804][   T18]  ? complete+0x2f/0x40
[    5.245804][   T18]  ? cpuhp_thread_fun+0xb6/0x150
[    5.245804][   T18]  ? sort_range+0x20/0x20
[    5.245804][   T18]  ? smpboot_thread_fn+0xd8/0x1c0
[    5.245804][   T18]  ? kthread+0x15a/0x190
[    5.245804][   T18]  ? set_kthread_struct+0x50/0x50
[    5.245804][   T18]  ? ret_from_fork+0x22/0x30
[    5.245804][   T18]  </TASK>
[    5.245804][   T18] Modules linked in:
[    5.245804][   T18] Supported: Yes
[    5.245804][   T18] CR2: 0000000000011003
[    5.245804][   T18] ---[ end trace 20694464dde8a519 ]---
[    5.245804][   T18] RIP: 0010:0x11003
[    5.245804][   T18] Code: Unable to access opcode bytes at RIP 0x10fd9.

According to my testing, as long as the write_efer() happens before
CPUHP_AP_ONLINE, the fatal page fault won't happen.

CPUHP_AP_HYPERV_TIMER_STARTING happens before CPUHP_AP_ONLINE, so let's
move hv_write_efer() into hv_stimer_init() (refer to hv_stimer_alloc())
so that we can avoid adding CPUHP_AP_HYPERV_FORCE_EFER_WRITE.

Signed-off-by: Dexuan Cui <decui@microsoft.com>
Acked-by: Olaf Hering <ohering@suse.de>
---
 drivers/clocksource/hyperv_timer.c | 33 ++++++++++++++++++++++++++++++
 1 file changed, 33 insertions(+)

--- a/drivers/clocksource/hyperv_timer.c
+++ b/drivers/clocksource/hyperv_timer.c
@@ -123,6 +123,37 @@ static int hv_ce_set_oneshot(struct clock_event_device *evt)
 	return 0;
 }
 
+static int hv_write_efer(void)
+{
+	unsigned long long efer;
+
+	if (!hv_isolation_type_tdx() || !ms_hyperv.paravisor_present)
+		return 0;
+
+	/*
+	 * Write EFER by force, otherwise the paravisor's hypercall
+	 * handler thinks that the VP is in 32-bit mode, and the
+	 * returning RIP is truncated to 32-bits, causing a fatal
+	 * page fault. This is a TDX-spefic issue because it looks
+	 * like the initial default value of EFER on non-boot VPs
+	 * already has the EFER.LMA bit, and when the reading of
+	 * EFER on a non-boot VP is the same as the value of EER
+	 * on VP0, Linux doesn't write the EFER register on a
+	 * non-boot VP: see the code in arch/x86/kernel/head_64.S
+	 * ("Avoid writing EFER if no change was made (for TDX guest)").
+	 * Also see commit 77a512e35db7 ("x86/boot: Avoid #VE during boot for TDX platforms")
+	 * Work around the issue for now by force an EFER write.
+	 *
+	 * This is a temporary hack. The latest Hyper-V dev build has been
+	 * fixed, but the fix won't be ported to Azure until the end of 2023.
+	 * It's safe to have the hack on future fixed Hyper-V.
+	 */
+	rdmsrl(MSR_EFER, efer);
+	wrmsrl(MSR_EFER, efer);
+
+	return 0;
+}
+
 /*
  * hv_stimer_init - Per-cpu initialization of the clockevent
  */
@@ -130,6 +161,8 @@ static int hv_stimer_init(unsigned int cpu)
 {
 	struct clock_event_device *ce;
 
+	hv_write_efer();
+
 	if (!hv_clock_event)
 		return 0;