source: branches/arm/lisp-kernel/ARM-notes.txt @ 13923

Last change on this file since 13923 was 13748, checked in by gb, 9 years ago

Complain about the lack of "trapped FP exceptions" on VFPv3.

File size: 13.4 KB
Line 
1Some notes on a CCL ARM port.
2
3- floating point
4
5Recent ARM architecture variants offer a mostly reasonable FPU (the
6"Vector Floating Point" unit); older variants either don't have
7hardware FP or have somewhat bizarre ones.  The current Linux ABI
8("gnueabi") tries to support all configurations by specifying that
9FP arithmetic be done via calls to support routines that take their
10arguments in and return results in GPRs.  For CCL, that'd amount to
11doing all FP arithmetic via the FFI, which doesn't sound too attractive.
12I think that we're better off assuming the presence of a VFP, using
13VFP FP instructions, and supporting FPU-less machines via emulation
14(either in the OS kernel or in CCL itself.)  ARM ABIs have moved away
15from this because emulation is apparently pretty slow; anyone who
16does enough floating-point in CCL for this to matter should probably
17get a machine with a real FPU.  (There are interesting FPU-less ARMs,
18so it's not quite the same as the situation with SSE2 on X86, but I
19don't see any reason to try to optimize for older variants.)
20
21More: recent ARM variants (ARMv7, at least) support vfp v3 with NEON
22SIMD extenstions.  If vfp v3, the exception-enable bits in the FPSCR
23always read as 0; that means that FP exceptions can't cause SIGFPE.
24
25On a jailbroken iPod Touch, an enabled FP exception seems to force
26a reboot instead of (or as well as) SIGFPE.
27
28Checking to see if an FP operation generated an interesting FP
29exception involves something like:
30
31 (fp-operation fp1 fp2 fp3)
32 (mov temp1 fpscr) ; the "exception occurred" bits get set and can be read
33 (load temp2 (tcr.lisp-exception-bits))
34 (ands temp3 temp1 temp2) ; set flags
35 (uuo-if (:? ne) ....)
36
37That just sucks; it makes FP exception checking a time/space/safety tradeoff.
38
39 
40
41
42
43- instruction set
44
45Some newer variants support Thumb2, which allows a mixture of 32-bit
46and 16-bit instructions (better code density) and some other new features.
47The original Thumb extensions offered 16-bit instructions, but basically
48didn't allow 32-bit and 16-bit instructions to be mixed.
49
50I don't think that we want to require Thumb2 (at least until it's more
51dominant than it is now), and I don't see any use for the original
52Thumb instructions.  That pretty much leaves us with the traditional ARM
53instruction set, where most instructions are conditional and a fairly
54traditional RISC architecture is exposed.  (IIRC, we can mix ARM and
55Thumb2 code with no performance or other penalty, so we can phase Thumb2
56in at some point in the future.)
57
58- cache issues.
59
60It's necessary to execute a system call in order to "make data executable",
61e.g., to ensure that code is in the icache after it's been written to memory
62via the dcache.  This suggests that we need to keep code vectors
63disjoint from functions (as in the PPC ports and for similar reasons),
64and that we can't easily recover FN from the PC (and need to make it be
65a more explicit argument on each call.)
66
67- tagging
68Use the PPC32 tagging scheme, unless/until a reason not to.
69One minor variation: I think that it's desirable to keep a code vector
70and its (fixnum-tagged) entry point separate (either by storing them both
71in the function object or by deriving the code vector from the entrypoint.
72(Either approach would likely require some care and some GC support.)
73On the PPC, it was possible move a tagged pointer to the CTR; branching
74to the CTR was defined to ignore the low 2 bits (though Rosetta sometimes
75forgot this.)  On the ARM, the results are undefined (are they ever ...)
76when bit 1 is set in the PC, and setting bit 0 results in an ARM->Thumb
77transition.  To load a code vector from a function and jump to it, we'd
78have to do:
79
80   ldr temp,[nfn,#function.codevector]
81   andn pc,temp,#lisptagmask    ; or add/sub/whatever
82
83instead of
84
85   ldr pc,[nfn,#function.entrypoint]
86
87The extra instruction isn't desirable, but likely wouldn't kill
88us.  The need for yet another temp register during function call
89is probably more of a problem, and possibly a significant problem.
90
91
92- register partitioning
93
94The ARM offers 16 GPRs, but the PC and link register are GPRs;
95we probably need to use the C stack as a control stack (to make
96it easier to to recognize return addresses) and therefore use
97a separate VSP (with no explicit frame pointer, ala PPC CCL);
98we need to keep the TCR in a GPR, and probably need to keep
99ALLOCPTR in a GPR as well (unless we do consing in some totally
100different way), we need to keep FN in a GPR ... the short version
101is that we're less register-starved than on x8632 and can probably
102use a static partitioning scheme, but we're more register-starved
103than even the x8664.  The current register partictioning isn't
104carved in stone: we may find that it's better to have more/less
105args/temps/imms, but I don't think that we can justify having
106callee-save NVRs.
107
108There are some cases in the runtime where we effectively want
109to exchange the lr with the saved lr value in a stack frame.  It's
110possible to do this with a "swp" instruction; "swp" as a means
111of doing interlocked memory operations has been deprecated in
112ARMv6 and later.  We don't care about those semantics, and if
113there are only a few places (calling UNWIND-PROTECT cleanup
114forms) that need to do this it's probably better to use "swp"
115rather than inventing a scheme that allows some register to
116be temporarily treated as a locative.
117
118
119- Subprim calls
120
121All other factors being equal, the best way to call a subprim
122would be via a PC-relative call ("bl") instruction.  Those
123factors aren't equal; we'd have to find and adjust all of those
124instructions if we move the containing code vector.  "b" and "bl"
125instructions use a 26-bit displacement (24 bits, in words), so
126we'd have to do something to ensure that all code vectors were
127withing +/- 32MB of the subprims jump table or a copy of it, and
128this seems like a lot of complicated overhead in order to use bl.
129
130The current plan seems to be almost as good (as good when code vectors
131are purified), but it's based on a few brutal hacks.
132
133
134ARM constant operands in MOV and ALU instructions are encoded as an
1358-bit value and a 4 bit rotate count which allows the value to be
136rotated right by an even number of bits; the number of unique 32-bit
137values that can be expressed in this scheme seems to be a litte over
1383000.  (Some values can be expressed in more than one combination of
1398-bit value/4-bit rotate count.)  Obviously, all integers < 256 can
140be encoded this way; integers >= 256 and < 1024 can be encoded if they're
141multiples of 4.  There are 192 encodable values between #x4000 and
142#x10000, and all of them are multiples of 256; we can plausibly
143reserve addresses in that range.  (Linux currently doesn't set
144the sysctl variable "vm.max_mmap_addr" on ARMs and sets it to 4K in
145recent x86 distributions; Darwin wants a PAGEZERO region before mapped
146memory, but we can control its size with linker options.)
147
148Using 256 bytes for a jump table entry would be wasteful; losing the
149jump table and using 256 bytes per subprim would be less so.  (Some
150large subprims - those dealing with THROW and unwinding, for instance,
151might not fit in 256 bytes and would have to be split into a part that
152fits in the 256-byte fixed address range and a part that doesn't.)
153
154Actually jumping to a subprim at address N (where N is an address expressible
155as an ARM constant) is just:
156
157   mov pc, #n
158
159There are a few ways to do a call; those that aren't PC-relative (and I'm
160leaning away from doing PC-relative calls in impure code) are generally
1612 instructions long:
162
163   mov lr, pc  ; when used as a source operand, the PC is read as .+8
164   mov pc,#n
165
166   or
167
168   mov reg,#n
169   blx reg      ; reg can be lr, e.g., blx lr goes to and returns to the
170                ; right addresses
171
172   or
173
174   bl jn
175   ...
176jn:
177   mov pc, #n
178
179
180In the last of these schemes, we might have a jump table of "mov pc,#n"
181instructions at the end of the code vector.  If we can purify code
182vectors to somewhere within 32MB of the actual subprims, then purify()
183can change calls into that jump table into PC-relative calls to the
184actual subprim code.  That's likely a small savings, but it might add
185up and it wouldn't be available under the first two approaches above.
186
187I generally like the whole idea of using immediate addressing to reference
188subprims, but it does depend on:
189
190 a) the OS allowing an application to use low addresses in its address
191    space
192 b) the linker allowing us to build the application that way.  (Linkers
193    generally provide a mechanism for this, but GNU ld scripts are
194    sometimes sensitive to C libary/toolchain versions, and it'd be
195    good to avoid depending on them if possible.
196
197I suppose that another negative is:
198
199 c) it's hard to create CCL shared libraries, because CCL wants more
200    control over address space layout than a shared library usually
201    has.  There are many other ways in which this is true, but it's
202    a PITA to have to keep answering that question that way.
203
204If these are the only negatives of this scheme, at the moment I'd say
205that the positives outweigh them.
206
207- implementation parameters
208
209We can probably make NIL be a constant (#x10000005 or something similar);
210the only real issues are whether we can count on mapping that address and
211whether it'd wind up in the middle of an address range that we'd like to
212use freely.
213
214We probably have to limit CALL-ARGUMENTS-LIMIT and friends to 256, unless
215we're willing to load larger values from memory or synthesize them via
216a sequence of MOV/OR in SET-NARGS, and we'd have to load or synthesize
217a similar value or two into a temp reg in CHECK-NARGS.  (I'm sure that
218both are doable; I'm not very certain that these things are worth doing.)
219
220Word-sized load and store instructions can use a constant displacement
221of +/- 4K bytes from a base register.  We will probably rarely approach
222these limits when referencing function constants and values in stack
223frames, but it'd be good to have a way to handle these situations besides
224saying "function/stack frame too large".
225
226ARM documentation basically says that when a constant value can't be
227represented as a rotated 8-bit value, it's best to load it from PC-relative
228memory; ARM C functions generally have these "constant pools" interspersed
229with executable code.  (Large functions might have them "interspersed";
230smaller functions likely have code "followed by" constant pools.)  We almost
231certainly have to deal with similar issues ... and they can be complicated.
232
233  [explain complexity here ...]
234
235
236 - misc
237
238I -hope- that the altstack mechanism works on all platforms and that
239we don't have to explicitly check for stack overflow on r13.  (If it
240does, we probably want to check for r13 overflow in the altstack/Mach
241fault handler and unprotect pages there.)
242
243
244 - consing & pc-lusering
245
246The magic consing sequence (that pc_luser_xp() and maybe handle_alloc_trap()
247will have to recognize) is basically:
248
2491: decrement allocptr by an amount that'll leave it tagged as either
250   tag_misc or tag_cons. 
251
252  a) CONSes:
253   (sub allocptr allocptr (:$ (- arm::cons.size arm::fulltag-cons)))
254
255  b) UVECTORs whose aligned size is known and <= 64K
256   (sub allocptr allocptr (:$ (logand (- size arm::fulltag-misc) #xff)))
257   (sub allocptr allocptr (:$ (logandc2 (- size arm::fulltag-misc)) #xff))
258
259   Note that the second instruction above can be omitted if (<= size 256);
260   if two instructions are used to decrement allocptr, they should be
261   done in this order.
262
263  c) UVECTORs whose size is > 64K or computed; this should be preceded
264  something which loads (- size arm::fulltag-misc) into some imm reg REG:
265
266   (sub allocptr allocptr REG)
267 
268After this instruction (after the first in case (b)), allocptr is tagged
269and the entire sequence needs to be completed or restarted by pc_luser_xp().
270
2712: Load (:@ rcontext (:$ arm::tcr.allocbase)) into some available GPR.
272We can't afford to dedicate a register to contain allocbase, and
273tcr.allobase can ordinarily change on any instruction boundardy so we
274can only do this after decrementing and tagging allocptr in step 1 above.
275
276   (ldr reg2 (:@ rcontext (:$ arm::tcr.allocbase)))
277
278"reg2" can't be the same register used in 1c and shouldn't conflict with
279anything used to contain headers/CAR/CDR values, but can otherwise be
280any GPR.
281
2823: Compare allocptr to the register holding allocbase; do a uuo-alloc-trap
283if allocptr is U< allocbase.  (We could encode the "lo" condition in
284uuo-alloc-trap if we wanted to>)
285
286   (cmp allocptr reg2)
287   (uuo-alloc-trap (:? lo))
288
289If the trap is taken, the handler should be able to determine the size
290and tag of the allocation and resume execution at the next instruction
291with allocptr pointing at zeroed memory.  The register used to hold
292allocbase won't change, but its value may or may not have anything to
293do with tcr.allocbase at this point.
294
2954: Initialize the object; if a UVECTOR, set its header.  If a CONS, set
296its CAR and CDR.
297
298    (str header (:@ allocptr (:$ arm::misc-header-offset)))
299
300    or
301
302    (str Rcar (:@ allocptr (:$ arm::cons.car)))
303    (str Rcdr (:@ allocptr (:$ arm::cons.cdr)))
304
3055: Copy allocptr to the destination register.  (In the CONS case, this may
306have been one of Rcar/Rcdr);
307
308    (mov dest allocptr)
309
3106: Clear the low 3 bits of allocptr.
311
312     (bic allocptr allocptr (:$ arm::fulltagmask))
313
314This is basically the same sequence as is used on the PPC; the differences
315are:
316
317  a) We might decrement allocptr twice in (1b)
318  b) We don't use a dedicated register to contain allocbase, and have
319     to load it from the TCR into a temp register after allocptr's been
320     decremented.
321  c) We don't exactly have conditional traps, so we have to do a compare
322     followed by a conditional UUO.
323
324
Note: See TracBrowser for help on using the repository browser.