meta/recipes-devtools/gcc/gcc/0003-aarch64-Mitigate-SLS-for-BLR-instruction.patch


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658

Upstream-Status: Backport
Signed-off-by: Ross Burton <ross.burton@arm.com>

From a5e7efc40ed841934c1d913f39476afa17d8e5f7 Mon Sep 17 00:00:00 2001
From: Matthew Malcomson <matthew.malcomson@arm.com>
Date: Thu, 9 Jul 2020 09:11:59 +0100
Subject: [PATCH 3/3] aarch64: Mitigate SLS for BLR instruction

This patch introduces the mitigation for Straight Line Speculation past
the BLR instruction.

This mitigation replaces BLR instructions with a BL to a stub which uses
a BR to jump to the original value.  These function stubs are then
appended with a speculation barrier to ensure no straight line
speculation happens after these jumps.

When optimising for speed we use a set of stubs for each function since
this should help the branch predictor make more accurate predictions
about where a stub should branch.

When optimising for size we use one set of stubs for all functions.
This set of stubs can have human readable names, and we are using
`__call_indirect_x<N>` for register x<N>.

When BTI branch protection is enabled the BLR instruction can jump to a
`BTI c` instruction using any register, while the BR instruction can
only jump to a `BTI c` instruction using the x16 or x17 registers.
Hence, in order to ensure this transformation is safe we mov the value
of the original register into x16 and use x16 for the BR.

As an example when optimising for size:
a
    BLR x0
instruction would get transformed to something like
    BL __call_indirect_x0
where __call_indirect_x0 labels a thunk that contains
__call_indirect_x0:
    MOV X16, X0
    BR X16
    <speculation barrier>

The first version of this patch used local symbols specific to a
compilation unit to try and avoid relocations.
This was mistaken since functions coming from the same compilation unit
can still be in different sections, and the assembler will insert
relocations at jumps between sections.

On any relocation the linker is permitted to emit a veneer to handle
jumps between symbols that are very far apart.  The registers x16 and
x17 may be clobbered by these veneers.
Hence the function stubs cannot rely on the values of x16 and x17 being
the same as just before the function stub is called.

Similar can be said for the hot/cold partitioning of single functions,
so function-local stubs have the same restriction.

This updated version of the patch never emits function stubs for x16 and
x17, and instead forces other registers to be used.

Given the above, there is now no benefit to local symbols (since they
are not enough to avoid dealing with linker intricacies).  This patch
now uses global symbols with hidden visibility each stored in their own
COMDAT section.  This means stubs can be shared between compilation
units while still avoiding the PLT indirection.

This patch also removes the `__call_indirect_x30` stub (and
function-local equivalent) which would simply jump back to the original
location.

The function-local stubs are emitted to the assembly output file in one
chunk, which means we need not add the speculation barrier directly
after each one.
This is because we know for certain that the instructions directly after
the BR in all but the last function stub will be from another one of
these stubs and hence will not contain a speculation gadget.
Instead we add a speculation barrier at the end of the sequence of
stubs.

The global stubs are emitted in COMDAT/.linkonce sections by
themselves so that the linker can remove duplicates from multiple object
files.  This means they are not emitted in one chunk, and each one must
include the speculation barrier.

Another difference is that since the global stubs are shared across
compilation units we do not know that all functions will be targeting an
architecture supporting the SB instruction.
Rather than provide multiple stubs for each architecture, we provide a
stub that will work for all architectures -- using the DSB+ISB barrier.

This mitigation does not apply for BLR instructions in the following
places:
- Some accesses to thread-local variables use a code sequence with a BLR
  instruction.  This code sequence is part of the binary interface between
  compiler and linker. If this BLR instruction needs to be mitigated, it'd
  probably be best to do so in the linker. It seems that the code sequence
  for thread-local variable access is unlikely to lead to a Spectre Revalation
  Gadget.
- PLT stubs are produced by the linker and each contain a BLR instruction.
  It seems that at most only after the last PLT stub a Spectre Revalation
  Gadget might appear.

Testing:
  Bootstrap and regtest on AArch64
    (with BOOT_CFLAGS="-mharden-sls=retbr,blr")
  Used a temporary hack(1) in gcc-dg.exp to use these options on every
  test in the testsuite, a slight modification to emit the speculation
  barrier after every function stub, and a script to check that the
  output never emitted a BLR, or unmitigated BR or RET instruction.
  Similar on an aarch64-none-elf cross-compiler.

1) Temporary hack emitted a speculation barrier at the end of every stub
function, and used a script to ensure that:
  a) Every RET or BR is immediately followed by a speculation barrier.
  b) No BLR instruction is emitted by compiler.

gcc/ChangeLog:

	* config/aarch64/aarch64-protos.h (aarch64_indirect_call_asm):
	New declaration.
	* config/aarch64/aarch64.c (aarch64_regno_regclass): Handle new
	stub registers class.
	(aarch64_class_max_nregs): Likewise.
	(aarch64_register_move_cost): Likewise.
	(aarch64_sls_shared_thunks): Global array to store stub labels.
	(aarch64_sls_emit_function_stub): New.
	(aarch64_create_blr_label): New.
	(aarch64_sls_emit_blr_function_thunks): New.
	(aarch64_sls_emit_shared_blr_thunks): New.
	(aarch64_asm_file_end): New.
	(aarch64_indirect_call_asm): New.
	(TARGET_ASM_FILE_END): Use aarch64_asm_file_end.
	(TARGET_ASM_FUNCTION_EPILOGUE): Use
	aarch64_sls_emit_blr_function_thunks.
	* config/aarch64/aarch64.h (STB_REGNUM_P): New.
	(enum reg_class): Add STUB_REGS class.
	(machine_function): Introduce `call_via` array for
	function-local stub labels.
	* config/aarch64/aarch64.md (*call_insn, *call_value_insn): Use
	aarch64_indirect_call_asm to emit code when hardening BLR
	instructions.
	* config/aarch64/constraints.md (Ucr): New constraint
	representing registers for indirect calls.  Is GENERAL_REGS
	usually, and STUB_REGS when hardening BLR instruction against
	SLS.
	* config/aarch64/predicates.md (aarch64_general_reg): STUB_REGS class
	is also a general register.

gcc/testsuite/ChangeLog:

	* gcc.target/aarch64/sls-mitigation/sls-miti-blr-bti.c: New test.
	* gcc.target/aarch64/sls-mitigation/sls-miti-blr.c: New test.
---
 gcc/config/aarch64/aarch64-protos.h                |   1 +
 gcc/config/aarch64/aarch64.c                       | 225 ++++++++++++++++++++-
 gcc/config/aarch64/aarch64.h                       |  15 ++
 gcc/config/aarch64/aarch64.md                      |  11 +-
 gcc/config/aarch64/constraints.md                  |   9 +
 gcc/config/aarch64/predicates.md                   |   3 +-
 .../aarch64/sls-mitigation/sls-miti-blr-bti.c      |  40 ++++
 .../aarch64/sls-mitigation/sls-miti-blr.c          |  33 +++
 8 files changed, 328 insertions(+), 9 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sls-mitigation/sls-miti-blr-bti.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sls-mitigation/sls-miti-blr.c

diff --git a/gcc/config/aarch64/aarch64-protos.h b/gcc/config/aarch64/aarch64-protos.h
index ee0ffde..839f801 100644
--- a/gcc/config/aarch64/aarch64-protos.h
+++ b/gcc/config/aarch64/aarch64-protos.h
@@ -782,6 +782,7 @@ extern const atomic_ool_names aarch64_ool_ldeor_names;
 tree aarch64_resolve_overloaded_builtin_general (location_t, tree, void *);
 
 const char *aarch64_sls_barrier (int);
+const char *aarch64_indirect_call_asm (rtx);
 extern bool aarch64_harden_sls_retbr_p (void);
 extern bool aarch64_harden_sls_blr_p (void);
 
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 2389d49..0f7bba3 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -10605,6 +10605,9 @@ aarch64_label_mentioned_p (rtx x)
 enum reg_class
 aarch64_regno_regclass (unsigned regno)
 {
+  if (STUB_REGNUM_P (regno))
+    return STUB_REGS;
+
   if (GP_REGNUM_P (regno))
     return GENERAL_REGS;
 
@@ -10939,6 +10942,7 @@ aarch64_class_max_nregs (reg_class_t regclass, machine_mode mode)
   unsigned int nregs, vec_flags;
   switch (regclass)
     {
+    case STUB_REGS:
     case TAILCALL_ADDR_REGS:
     case POINTER_REGS:
     case GENERAL_REGS:
@@ -13155,10 +13159,12 @@ aarch64_register_move_cost (machine_mode mode,
     = aarch64_tune_params.regmove_cost;
 
   /* Caller save and pointer regs are equivalent to GENERAL_REGS.  */
-  if (to == TAILCALL_ADDR_REGS || to == POINTER_REGS)
+  if (to == TAILCALL_ADDR_REGS || to == POINTER_REGS
+      || to == STUB_REGS)
     to = GENERAL_REGS;
 
-  if (from == TAILCALL_ADDR_REGS || from == POINTER_REGS)
+  if (from == TAILCALL_ADDR_REGS || from == POINTER_REGS
+      || from == STUB_REGS)
     from = GENERAL_REGS;
 
   /* Make RDFFR very expensive.  In particular, if we know that the FFR
@@ -22957,6 +22963,215 @@ aarch64_sls_barrier (int mitigation_required)
     : "";
 }
 
+static GTY (()) tree aarch64_sls_shared_thunks[30];
+static GTY (()) bool aarch64_sls_shared_thunks_needed = false;
+const char *indirect_symbol_names[30] = {
+    "__call_indirect_x0",
+    "__call_indirect_x1",
+    "__call_indirect_x2",
+    "__call_indirect_x3",
+    "__call_indirect_x4",
+    "__call_indirect_x5",
+    "__call_indirect_x6",
+    "__call_indirect_x7",
+    "__call_indirect_x8",
+    "__call_indirect_x9",
+    "__call_indirect_x10",
+    "__call_indirect_x11",
+    "__call_indirect_x12",
+    "__call_indirect_x13",
+    "__call_indirect_x14",
+    "__call_indirect_x15",
+    "", /* "__call_indirect_x16",  */
+    "", /* "__call_indirect_x17",  */
+    "__call_indirect_x18",
+    "__call_indirect_x19",
+    "__call_indirect_x20",
+    "__call_indirect_x21",
+    "__call_indirect_x22",
+    "__call_indirect_x23",
+    "__call_indirect_x24",
+    "__call_indirect_x25",
+    "__call_indirect_x26",
+    "__call_indirect_x27",
+    "__call_indirect_x28",
+    "__call_indirect_x29",
+};
+
+/* Function to create a BLR thunk.  This thunk is used to mitigate straight
+   line speculation.  Instead of a simple BLR that can be speculated past,
+   we emit a BL to this thunk, and this thunk contains a BR to the relevant
+   register.  These thunks have the relevant speculation barries put after
+   their indirect branch so that speculation is blocked.
+
+   We use such a thunk so the speculation barriers are kept off the
+   architecturally executed path in order to reduce the performance overhead.
+
+   When optimizing for size we use stubs shared by the linked object.
+   When optimizing for performance we emit stubs for each function in the hope
+   that the branch predictor can better train on jumps specific for a given
+   function.  */
+rtx
+aarch64_sls_create_blr_label (int regnum)
+{
+  gcc_assert (STUB_REGNUM_P (regnum));
+  if (optimize_function_for_size_p (cfun))
+    {
+      /* For the thunks shared between different functions in this compilation
+	 unit we use a named symbol -- this is just for users to more easily
+	 understand the generated assembly.  */
+      aarch64_sls_shared_thunks_needed = true;
+      const char *thunk_name = indirect_symbol_names[regnum];
+      if (aarch64_sls_shared_thunks[regnum] == NULL)
+	{
+	  /* Build a decl representing this function stub and record it for
+	     later.  We build a decl here so we can use the GCC machinery for
+	     handling sections automatically (through `get_named_section` and
+	     `make_decl_one_only`).  That saves us a lot of trouble handling
+	     the specifics of different output file formats.  */
+	  tree decl = build_decl (BUILTINS_LOCATION, FUNCTION_DECL,
+				  get_identifier (thunk_name),
+				  build_function_type_list (void_type_node,
+							    NULL_TREE));
+	  DECL_RESULT (decl) = build_decl (BUILTINS_LOCATION, RESULT_DECL,
+					   NULL_TREE, void_type_node);
+	  TREE_PUBLIC (decl) = 1;
+	  TREE_STATIC (decl) = 1;
+	  DECL_IGNORED_P (decl) = 1;
+	  DECL_ARTIFICIAL (decl) = 1;
+	  make_decl_one_only (decl, DECL_ASSEMBLER_NAME (decl));
+	  resolve_unique_section (decl, 0, false);
+	  aarch64_sls_shared_thunks[regnum] = decl;
+	}
+
+      return gen_rtx_SYMBOL_REF (Pmode, thunk_name);
+    }
+
+  if (cfun->machine->call_via[regnum] == NULL)
+    cfun->machine->call_via[regnum]
+      = gen_rtx_LABEL_REF (Pmode, gen_label_rtx ());
+  return cfun->machine->call_via[regnum];
+}
+
+/* Helper function for aarch64_sls_emit_blr_function_thunks and
+   aarch64_sls_emit_shared_blr_thunks below.  */
+static void
+aarch64_sls_emit_function_stub (FILE *out_file, int regnum)
+{
+  /* Save in x16 and branch to that function so this transformation does
+     not prevent jumping to `BTI c` instructions.  */
+  asm_fprintf (out_file, "\tmov\tx16, x%d\n", regnum);
+  asm_fprintf (out_file, "\tbr\tx16\n");
+}
+
+/* Emit all BLR stubs for this particular function.
+   Here we emit all the BLR stubs needed for the current function.  Since we
+   emit these stubs in a consecutive block we know there will be no speculation
+   gadgets between each stub, and hence we only emit a speculation barrier at
+   the end of the stub sequences.
+
+   This is called in the TARGET_ASM_FUNCTION_EPILOGUE hook.  */
+void
+aarch64_sls_emit_blr_function_thunks (FILE *out_file)
+{
+  if (! aarch64_harden_sls_blr_p ())
+    return;
+
+  bool any_functions_emitted = false;
+  /* We must save and restore the current function section since this assembly
+     is emitted at the end of the function.  This means it can be emitted *just
+     after* the cold section of a function.  That cold part would be emitted in
+     a different section.  That switch would trigger a `.cfi_endproc` directive
+     to be emitted in the original section and a `.cfi_startproc` directive to
+     be emitted in the new section.  Switching to the original section without
+     restoring would mean that the `.cfi_endproc` emitted as a function ends
+     would happen in a different section -- leaving an unmatched
+     `.cfi_startproc` in the cold text section and an unmatched `.cfi_endproc`
+     in the standard text section.  */
+  section *save_text_section = in_section;
+  switch_to_section (function_section (current_function_decl));
+  for (int regnum = 0; regnum < 30; ++regnum)
+    {
+      rtx specu_label = cfun->machine->call_via[regnum];
+      if (specu_label == NULL)
+	continue;
+
+      targetm.asm_out.print_operand (out_file, specu_label, 0);
+      asm_fprintf (out_file, ":\n");
+      aarch64_sls_emit_function_stub (out_file, regnum);
+      any_functions_emitted = true;
+    }
+  if (any_functions_emitted)
+    /* Can use the SB if needs be here, since this stub will only be used
+      by the current function, and hence for the current target.  */
+    asm_fprintf (out_file, "\t%s\n", aarch64_sls_barrier (true));
+  switch_to_section (save_text_section);
+}
+
+/* Emit shared BLR stubs for the current compilation unit.
+   Over the course of compiling this unit we may have converted some BLR
+   instructions to a BL to a shared stub function.  This is where we emit those
+   stub functions.
+   This function is for the stubs shared between different functions in this
+   compilation unit.  We share when optimizing for size instead of speed.
+
+   This function is called through the TARGET_ASM_FILE_END hook.  */
+void
+aarch64_sls_emit_shared_blr_thunks (FILE *out_file)
+{
+  if (! aarch64_sls_shared_thunks_needed)
+    return;
+
+  for (int regnum = 0; regnum < 30; ++regnum)
+    {
+      tree decl = aarch64_sls_shared_thunks[regnum];
+      if (!decl)
+	continue;
+
+      const char *name = indirect_symbol_names[regnum];
+      switch_to_section (get_named_section (decl, NULL, 0));
+      ASM_OUTPUT_ALIGN (out_file, 2);
+      targetm.asm_out.globalize_label (out_file, name);
+      /* Only emits if the compiler is configured for an assembler that can
+	 handle visibility directives.  */
+      targetm.asm_out.assemble_visibility (decl, VISIBILITY_HIDDEN);
+      ASM_OUTPUT_TYPE_DIRECTIVE (out_file, name, "function");
+      ASM_OUTPUT_LABEL (out_file, name);
+      aarch64_sls_emit_function_stub (out_file, regnum);
+      /* Use the most conservative target to ensure it can always be used by any
+	 function in the translation unit.  */
+      asm_fprintf (out_file, "\tdsb\tsy\n\tisb\n");
+      ASM_DECLARE_FUNCTION_SIZE (out_file, name, decl);
+    }
+}
+
+/* Implement TARGET_ASM_FILE_END.  */
+void
+aarch64_asm_file_end ()
+{
+  aarch64_sls_emit_shared_blr_thunks (asm_out_file);
+  /* Since this function will be called for the ASM_FILE_END hook, we ensure
+     that what would be called otherwise (e.g. `file_end_indicate_exec_stack`
+     for FreeBSD) still gets called.  */
+#ifdef TARGET_ASM_FILE_END
+  TARGET_ASM_FILE_END ();
+#endif
+}
+
+const char *
+aarch64_indirect_call_asm (rtx addr)
+{
+  gcc_assert (REG_P (addr));
+  if (aarch64_harden_sls_blr_p ())
+    {
+      rtx stub_label = aarch64_sls_create_blr_label (REGNO (addr));
+      output_asm_insn ("bl\t%0", &stub_label);
+    }
+  else
+   output_asm_insn ("blr\t%0", &addr);
+  return "";
+}
+
 /* Target-specific selftests.  */
 
 #if CHECKING_P
@@ -23507,6 +23722,12 @@ aarch64_libgcc_floating_mode_supported_p
 #undef TARGET_MD_ASM_ADJUST
 #define TARGET_MD_ASM_ADJUST arm_md_asm_adjust
 
+#undef TARGET_ASM_FILE_END
+#define TARGET_ASM_FILE_END aarch64_asm_file_end
+
+#undef TARGET_ASM_FUNCTION_EPILOGUE
+#define TARGET_ASM_FUNCTION_EPILOGUE aarch64_sls_emit_blr_function_thunks
+
 struct gcc_target targetm = TARGET_INITIALIZER;
 
 #include "gt-aarch64.h"
diff --git a/gcc/config/aarch64/aarch64.h b/gcc/config/aarch64/aarch64.h
index 8e0fc37..7331450 100644
--- a/gcc/config/aarch64/aarch64.h
+++ b/gcc/config/aarch64/aarch64.h
@@ -643,6 +643,16 @@ extern unsigned aarch64_architecture_version;
 #define GP_REGNUM_P(REGNO)						\
   (((unsigned) (REGNO - R0_REGNUM)) <= (R30_REGNUM - R0_REGNUM))
 
+/* Registers known to be preserved over a BL instruction.  This consists of the
+   GENERAL_REGS without x16, x17, and x30.  The x30 register is changed by the
+   BL instruction itself, while the x16 and x17 registers may be used by
+   veneers which can be inserted by the linker.  */
+#define STUB_REGNUM_P(REGNO) \
+  (GP_REGNUM_P (REGNO) \
+   && (REGNO) != R16_REGNUM \
+   && (REGNO) != R17_REGNUM \
+   && (REGNO) != R30_REGNUM) \
+
 #define FP_REGNUM_P(REGNO)			\
   (((unsigned) (REGNO - V0_REGNUM)) <= (V31_REGNUM - V0_REGNUM))
 
@@ -667,6 +677,7 @@ enum reg_class
 {
   NO_REGS,
   TAILCALL_ADDR_REGS,
+  STUB_REGS,
   GENERAL_REGS,
   STACK_REG,
   POINTER_REGS,
@@ -689,6 +700,7 @@ enum reg_class
 {						\
   "NO_REGS",					\
   "TAILCALL_ADDR_REGS",				\
+  "STUB_REGS",					\
   "GENERAL_REGS",				\
   "STACK_REG",					\
   "POINTER_REGS",				\
@@ -708,6 +720,7 @@ enum reg_class
 {									\
   { 0x00000000, 0x00000000, 0x00000000 },	/* NO_REGS */		\
   { 0x00030000, 0x00000000, 0x00000000 },	/* TAILCALL_ADDR_REGS */\
+  { 0x3ffcffff, 0x00000000, 0x00000000 },	/* STUB_REGS */		\
   { 0x7fffffff, 0x00000000, 0x00000003 },	/* GENERAL_REGS */	\
   { 0x80000000, 0x00000000, 0x00000000 },	/* STACK_REG */		\
   { 0xffffffff, 0x00000000, 0x00000003 },	/* POINTER_REGS */	\
@@ -862,6 +875,8 @@ typedef struct GTY (()) machine_function
   struct aarch64_frame frame;
   /* One entry for each hard register.  */
   bool reg_is_wrapped_separately[LAST_SAVED_REGNUM];
+  /* One entry for each general purpose register.  */
+  rtx call_via[SP_REGNUM];
   bool label_is_assembled;
 } machine_function;
 #endif
diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
index dda04ee..43da754 100644
--- a/gcc/config/aarch64/aarch64.md
+++ b/gcc/config/aarch64/aarch64.md
@@ -1022,16 +1022,15 @@
 )
 
 (define_insn "*call_insn"
-  [(call (mem:DI (match_operand:DI 0 "aarch64_call_insn_operand" "r, Usf"))
+  [(call (mem:DI (match_operand:DI 0 "aarch64_call_insn_operand" "Ucr, Usf"))
 	 (match_operand 1 "" ""))
    (unspec:DI [(match_operand:DI 2 "const_int_operand")] UNSPEC_CALLEE_ABI)
    (clobber (reg:DI LR_REGNUM))]
   ""
   "@
-  blr\\t%0
+  * return aarch64_indirect_call_asm (operands[0]);
   bl\\t%c0"
-  [(set_attr "type" "call, call")]
-)
+  [(set_attr "type" "call, call")])
 
 (define_expand "call_value"
   [(parallel
@@ -1050,13 +1049,13 @@
 
 (define_insn "*call_value_insn"
   [(set (match_operand 0 "" "")
-	(call (mem:DI (match_operand:DI 1 "aarch64_call_insn_operand" "r, Usf"))
+	(call (mem:DI (match_operand:DI 1 "aarch64_call_insn_operand" "Ucr, Usf"))
 		      (match_operand 2 "" "")))
    (unspec:DI [(match_operand:DI 3 "const_int_operand")] UNSPEC_CALLEE_ABI)
    (clobber (reg:DI LR_REGNUM))]
   ""
   "@
-  blr\\t%1
+  * return aarch64_indirect_call_asm (operands[1]);
   bl\\t%c1"
   [(set_attr "type" "call, call")]
 )
diff --git a/gcc/config/aarch64/constraints.md b/gcc/config/aarch64/constraints.md
index d993268..8cc6f50 100644
--- a/gcc/config/aarch64/constraints.md
+++ b/gcc/config/aarch64/constraints.md
@@ -24,6 +24,15 @@
 (define_register_constraint "Ucs" "TAILCALL_ADDR_REGS"
   "@internal Registers suitable for an indirect tail call")
 
+(define_register_constraint "Ucr"
+    "aarch64_harden_sls_blr_p () ? STUB_REGS : GENERAL_REGS"
+  "@internal Registers to be used for an indirect call.
+   This is usually the general registers, but when we are hardening against
+   Straight Line Speculation we disallow x16, x17, and x30 so we can use
+   indirection stubs.  These indirection stubs cannot use the above registers
+   since they will be reached by a BL that may have to go through a linker
+   veneer.")
+
 (define_register_constraint "w" "FP_REGS"
   "Floating point and SIMD vector registers.")
 
diff --git a/gcc/config/aarch64/predicates.md b/gcc/config/aarch64/predicates.md
index 215fcec..1754b1e 100644
--- a/gcc/config/aarch64/predicates.md
+++ b/gcc/config/aarch64/predicates.md
@@ -32,7 +32,8 @@
 
 (define_predicate "aarch64_general_reg"
   (and (match_operand 0 "register_operand")
-       (match_test "REGNO_REG_CLASS (REGNO (op)) == GENERAL_REGS")))
+       (match_test "REGNO_REG_CLASS (REGNO (op)) == STUB_REGS
+		    || REGNO_REG_CLASS (REGNO (op)) == GENERAL_REGS")))
 
 ;; Return true if OP a (const_int 0) operand.
 (define_predicate "const0_operand"
diff --git a/gcc/testsuite/gcc.target/aarch64/sls-mitigation/sls-miti-blr-bti.c b/gcc/testsuite/gcc.target/aarch64/sls-mitigation/sls-miti-blr-bti.c
new file mode 100644
index 0000000..b1fb754
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/sls-mitigation/sls-miti-blr-bti.c
@@ -0,0 +1,40 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-mharden-sls=blr -mbranch-protection=bti" } */
+/*
+   Ensure that the SLS hardening of BLR leaves no BLR instructions.
+   Here we also check that there are no BR instructions with anything except an
+   x16 or x17 register.  This is because a `BTI c` instruction can be branched
+   to using a BLR instruction using any register, but can only be branched to
+   with a BR using an x16 or x17 register.
+  */
+typedef int (foo) (int, int);
+typedef void (bar) (int, int);
+struct sls_testclass {
+    foo *x;
+    bar *y;
+    int left;
+    int right;
+};
+
+/* We test both RTL patterns for a call which returns a value and a call which
+   does not.  */
+int blr_call_value (struct sls_testclass x)
+{
+  int retval = x.x(x.left, x.right);
+  if (retval % 10)
+    return 100;
+  return 9;
+}
+
+int blr_call (struct sls_testclass x)
+{
+  x.y(x.left, x.right);
+  if (x.left % 10)
+    return 100;
+  return 9;
+}
+
+/* { dg-final { scan-assembler-not {\tblr\t} } } */
+/* { dg-final { scan-assembler-not {\tbr\tx(?!16|17)} } } */
+/* { dg-final { scan-assembler {\tbr\tx(16|17)} } } */
+
diff --git a/gcc/testsuite/gcc.target/aarch64/sls-mitigation/sls-miti-blr.c b/gcc/testsuite/gcc.target/aarch64/sls-mitigation/sls-miti-blr.c
new file mode 100644
index 0000000..88bafff
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/sls-mitigation/sls-miti-blr.c
@@ -0,0 +1,33 @@
+/* { dg-additional-options "-mharden-sls=blr -save-temps" } */
+/* Ensure that the SLS hardening of BLR leaves no BLR instructions.
+   We only test that all BLR instructions have been removed, not that the
+   resulting code makes sense.  */
+typedef int (foo) (int, int);
+typedef void (bar) (int, int);
+struct sls_testclass {
+    foo *x;
+    bar *y;
+    int left;
+    int right;
+};
+
+/* We test both RTL patterns for a call which returns a value and a call which
+   does not.  */
+int blr_call_value (struct sls_testclass x)
+{
+  int retval = x.x(x.left, x.right);
+  if (retval % 10)
+    return 100;
+  return 9;
+}
+
+int blr_call (struct sls_testclass x)
+{
+  x.y(x.left, x.right);
+  if (x.left % 10)
+    return 100;
+  return 9;
+}
+
+/* { dg-final { scan-assembler-not {\tblr\t} } } */
+/* { dg-final { scan-assembler {\tbr\tx[0-9][0-9]?} } } */
-- 
2.7.4