Opened 4 years ago

Last modified 4 years ago

#1271 reopened defect

YMM registers corrupted by foreign function interrupts

Reported by: jared Owned by: gb
Priority: normal Milestone:
Component: Foreign Function Interface Version: trunk
Keywords: interrupts, ymm, avx2 Cc: varun@…

Description

Hi,

I think CCL/Linux/X86-64 may have a bug that causes the upper bits of the YMM registers to be corrupted when an interrupt occurs during a foreign function call. This can affect the behavior of foreign code that uses these registers.

A tarball is attached with a distilled example to try to allow you to reproduce the bug. Note that to run this test and see the bug your CPU will need AVX2 support, e.g., you will probably need to run this on a Haswell processor.

Instructions:

  • Extract the tarball somewhere.
  • (Optional) Run "make" to build libtemp.so.0.1.

You can try just using the version I've included in the tarball, but I don't know how likely it is to be compatible with your system.

If you need to build it yourself, you will need a copy of NASM (http://www.nasm.us) to assemble the tempasm.o file.

  • Run CCL and submit the following to verify that the test code works:
          (load "temp.lisp")
          (doit)
    

This should print "Hello from C!" and after several seconds should print "Answer is correct."

  • Now try to reproduce the bug. To do this:
  • Submit (doit) again
  • Wait for it to say "Calling assembly code routine:"
  • Use Ctrl+C to send an interrupt.
  • Wait for the Break prompt.
  • Type :go
  • It should then print "Incorrect answer" and some details.

I am well out of my element here, but it looks to me like the problem _might_ stem from the function

x86-exceptions.c:handle_signal_on_foreign_stack

Which does this:

    #ifdef LINUX
      foreign_rsp = copy_fpregs(context, foreign_rsp, &fpregs);
    #endif
    #ifdef FREEBSD
      foreign_rsp = copy_avx(context, foreign_rsp, &fpregs);
    #endif
    ...

Perhaps the Linux case needs to do something more like the FreeBSD case, to copy the AVX registers as well?

Some information about my system:

$ ccl --version
Version 1.11-dev-r16355M-trunk  (LinuxX8664)

$ uname -a
Linux compute-1-4.local 2.6.32-431.el6.x86_64 #1 SMP Thu Nov 21 13:35:52 CST 2013 x86_64 x86_64 x86_64 GNU/Linux

$ cat /proc/cpuinfo
processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 60
model name	: Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz
stepping	: 3
cpu MHz		: 800.000
cache size	: 8192 KB
physical id	: 0
siblings	: 8
core id		: 0
cpu cores	: 4
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 fma cx16 xtpr pdcm pcid sse4_1 sse4_2 movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm ida arat xsaveopt pln pts dts tpr_shadow vnmi flexpriority ept vpid fsgsbase bmi1 avx2 smep bmi2 erms invpcid
bogomips	: 6997.67
clflush size	: 64
cache_alignment	: 64
address sizes	: 39 bits physical, 48 bits virtual
power management:
[...]

Please let me know if I can provide any other details that would be useful.

Cheers, Jared

Attachments (1)

lisptest.tar.gz (7.4 KB) - added by jared 4 years ago.
Tarball with test code

Download all attachments as: .zip

Change History (8)

Changed 4 years ago by jared

Tarball with test code

comment:1 Changed 4 years ago by gb

  • Owner set to gb
  • Status changed from new to assigned

comment:2 Changed 4 years ago by gb

for GC-related reasons, we try to copy signal contexts between stacks. on x86

I think that some people (quite possibly including rme) have expressed concern that this may not be viable in the long term. I'm starting to agree with that point of view. At the very least, it's very complicated code that has to work very well if it's there at all.

we generally want it to be possible to parse the lisp cstack. On other platforms, we view the stack as containing alternating regions of parsable lisp data and raw foreign data, and schemes to support this don't have to be complicated (and would likely be simpler than handle_signal_on_foreign_stack() and friends

I think that I understand where Linux hides AVX state in a signal context, and I think that I know how to copy that state and how to tell how large it is. So far, doing a better job of copying the signal context hasn't fixed the problem. I want to understand why it doesn't, but it would certainly be nnot to have to do this. a signal handler in CCL pretty much has to run as foreign code and our stack-switching is part of that. but copying signal contexts around seems to be asking to lose.

comment:3 Changed 4 years ago by gb

  • Resolution set to fixed
  • Status changed from assigned to closed

(In [16364]) for 64-bit Linux: copy_fpregs() copies the secret AVX state correctly. fixes ticket:1271 in the trunk. TODO: look at x8632, Darwin, Solaris, and Windows.

comment:4 Changed 4 years ago by jared

  • Resolution fixed deleted
  • Status changed from closed to reopened

Hi,

I've tried this out and it seems to work great on a Haswell machine. Unfortunately it appears to cause CCL to crash on various other machines. This crash may be much more severe/widespread than just the FFI: I seem to be running into segfaults while trying to certify many ACL2 books, for instance.

I can reliably trigger a crash by just running:

(load "temp.lisp")
(doit)

on a suitable machine. Here's a log that shows what happens:

$ hostname
compute-1-2.local
$ cat /proc/cpuinfo | grep model | head -2
model		: 45
model name	: Intel(R) Core(TM) i7-3930K CPU @ 3.20GHz
$ ccl 
Welcome to Clozure Common Lisp Version 1.11-dev-r16365M-trunk  (LinuxX8664)!

CCL is developed and maintained by Clozure Associates. For more information
about CCL visit http://ccl.clozure.com.  To enquire about Clozure's Common Lisp
consulting services e-mail info@clozure.com or visit http://www.clozure.com.

? (load "temp.lisp")
(load "temp.lisp")
#P"/share/apps/fv/jared/xb/cn/proofs/c86/xval/ccl-ymm-bugreport/temp.lisp"
? (doit)
(doit)
Hello from C!
 - head: #ux8000_0000
 - body: #uxffff_eeee_dddd_cccc_bbbb_aaaa_9999_8888_7777_6666_5555_4444_3333_2222_1111_dead
Calling assembly code routine:
Unhandled exception 4 at 0x2b44f804e9c5, context->regs at #x2b44d95b22f8
Exception occurred while executing foreign code
? for help
[23289] Clozure CL kernel debugger: 

It looks like I get crashes on many computers -- maybe anything without AVX2?

  • Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz --- works fine, no crash
  • Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz --- crashes
  • Intel(R) Core(TM) i5-3570K CPU @ 3.40GHz --- crashes
  • Intel(R) Xeon(R) CPU X7350 @ 2.93GHz -- crashes

Cheers, Jared

comment:5 Changed 4 years ago by jared

  • Resolution set to fixed
  • Status changed from reopened to closed

Arrgh, sorry, please disregard all of this --- I'm a complete idiot and forgot that this test runs an AVX2 instruction and hence will certainly crash AVX2 machines. I'll look into whether I can debug the other segfaults I'm getting.

comment:6 Changed 4 years ago by gb

The code that actually changed runs every time that a thread runs a signal handler. That happens very often.

A few months ago (late January/early February) someone had decided that ACL2(h) should run with the EGC on. That didn't work, and I was looking into how that could be made to work when I had a fire. I don't know if or how that was resolved.

comment:7 Changed 4 years ago by gb

  • Resolution fixed deleted
  • Status changed from closed to reopened

The change that was intended to fix this introduced severe problems for people running on Linux 2.6.32 kernels.

I don't know why and have not been able to guess.

At some point, it will lkely be necessary to drop support for very old OS versions.

as far as I know, the fix worked and was otherwise reliable, so it should be straightforward to reinstate it.

Note: See TracTickets for help on using tickets.