Opened 4 years ago

Closed 3 years ago

#1410 closed defect (fixed)

Segmentation fault at safety 3

Reported by: matt.kaufmann Owned by: rme
Priority: major Milestone:
Component: Compiler Version: trunk
Keywords: Cc:

Description

It seems very surprising to get a segmentation fault with safety 3. Below are instructions from the README file of the attached gzipped tarfile, which show how to re-create that behavior. This example is a lightly modified version of a real problem we are having, where we cannot debug the problem because of the segmentation fault. Note: We did this on linux:

dunnottar:~% uname -a Linux dunnottar 3.13.0-110-generic #157-Ubuntu SMP Mon Feb 20 11:54:05 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux dunnottar:~%

But we have seen this problem on Mac as well.

Note: We have reproduced this bug using this CCL

Welcome to Clozure Common Lisp Version 1.12-dev-r16783M-trunk (LinuxX8664)!

but also ones as recent as last weekend.

.....

wget https://github.com/acl2-devel/acl2-devel/releases/download/7.3/acl2-7.3.tar.gz
tar xfz acl2-7.3.tar.gz
# Save disk space
rm -rf acl2-7.3/books
rm acl2-7.3.tar.gz
cd acl2-7.3
(time nice make LISP=ccl ACL2_SAFETY=3) >& make-safety-3.log
cd ..
./acl2-7.3/saved_acl2
(value :q)
(load "clrat-parser.lisp")
(clrat-read-file "R_4_4_18.clrat" state)

Attachments (1)

ccl-bug.tgz (4.7 MB) - added by matt.kaufmann 4 years ago.
See ticket description (or README from this gzipped tarfile)

Change History (16)

Changed 4 years ago by matt.kaufmann

See ticket description (or README from this gzipped tarfile)

comment:1 Changed 4 years ago by rme

  • Owner set to rme

Clearly, stuff shouldn't crash when built with safety 3. Full stop.

As a matter of curiosity, though, what are you looking for in terms of safety that the default safety 1 doesn't provide?

comment:2 Changed 4 years ago by matt.kaufmann

Regarding your question about what I'm "looking for in terms of safety":

Actually, we normally run at safety 3. As best I recall (it's been a few months since I reported this), I was trying to debug some problem and I kept increasing the safety, hoping that I could get a backtrace.

comment:3 Changed 4 years ago by rme

I had no idea that you ran at safety 3 normally. I'm glad to know that.

Like I say, that should always work (absent errors calling foreign code, etc.). But it's true that we have never regularly exercised our tests (or a CCL for that matter) built with non-default safety (i.e., safety 1). That's a shortcoming I'll need to remedy.

comment:4 Changed 4 years ago by matt.kaufmann

Oh my gosh -- so sorry, I misspoke!! I meant to say that normally we run at safety 0. After the segfault I kept increasing the safety all the way to 3, and kept getting the segfault.

FYI, normally we declaim as follows, to maximize speed.

(OPTIMIZE (COMPILATION-SPEED 0) (DEBUG 0) (SPEED 3) (SPACE 0) (SAFETY 0))

comment:5 Changed 4 years ago by matt.kaufmann

Ah, one more thing: every few months, before an ACL2 release, I do an ACL2 regression with safety 3, just as an extra check.

comment:6 Changed 4 years ago by rme

I can duplicate the bug. As you allude, the signal just kills the whole process with extreme prejudice.

This sometimes means that stack pointer ends up pointing to unmapped memory, and when the OS tries to push a signal context in order to run a signal handler, there's nothing more to do than to make the process go "poof" with no further comment.

Using CCL 1.11 works.

comment:7 Changed 3 years ago by rme

I think this is saying that create_thread_context_frame is trying to find an mcontext_t struct on a stack it finds in the thread's state. Could we be failing to find the correct stack pointer?

Process 25901 stopped
* thread #2, stop reason = EXC_BAD_ACCESS (code=2, address=0x7000061d6b10)
    frame #0: 0x00007fff5f1ba02c libsystem_platform.dylib`_platform_memmove$VARIANT$Haswell + 268
libsystem_platform.dylib`_platform_memmove$VARIANT$Haswell:
->  0x7fff5f1ba02c <+268>: vmovups %ymm0, (%rax)
    0x7fff5f1ba030 <+272>: vxorps %ymm0, %ymm0, %ymm0
    0x7fff5f1ba034 <+276>: vmovups 0x20(%rsi), %ymm2
    0x7fff5f1ba039 <+281>: addq   $0x40, %rsi
Target 0: (dx86cl64) stopped.
(lldb) gui
(lldb) bt
* thread #2, stop reason = EXC_BAD_ACCESS (code=2, address=0x7000061d6b10)
  * frame #0: 0x00007fff5f1ba02c libsystem_platform.dylib`_platform_memmove$VARIANT$Haswell + 268
    frame #1: 0x00007fff5f006de2 libsystem_c.dylib`__memmove_chk + 22
    frame #2: 0x00000000000327dd dx86cl64`create_thread_context_frame(thread=62979, new_stack_top=0x00007000061d3a20, info_ptr=0x00007000061d3a10, tcr=0x0000000000103090, ts=0x00000000000b9040) at x86-exceptions.c:3167
    frame #3: 0x00000000000328b4 dx86cl64`setup_signal_frame(thread=62979, handler_address=0x0000000000031400, signum=10, code=2, tcr=0x0000000000103090, ts=0x00000000000b9040, new_ts=0x00000000000b802c) at x86-exceptions.c:3224
    frame #4: 0x0000000000032f42 dx86cl64`catch_mach_exception_raise_state(exception_port=63235, exception=1, code=0x00000000000b9028, code_count=2, flavor=0x00000000000b9038, in_state=0x00000000000b9040, in_state_count=42, out_state=0x00000000000b802c, out_state_count=0x00000000000b8028) at x86-exceptions.c:3448
    frame #5: 0x000000000003e8b8 dx86cl64`_Xmach_exception_raise_state(InHeadP=0x00000000000b9000, OutHeadP=0x00000000000b8000) at mach_exc_server.c:318
    frame #6: 0x000000000003eb8c dx86cl64`mach_exc_server(InHeadP=0x00000000000b9000, OutHeadP=0x00000000000b8000) at mach_exc_server.c:518
    frame #7: 0x00007fff5f07cc00 libsystem_kernel.dylib`mach_msg_server + 417
    frame #8: 0x000000000003308e dx86cl64`exception_handler_proc(arg=0x0000000000001103) at x86-exceptions.c:3505
    frame #9: 0x00007fff5f1c06c1 libsystem_pthread.dylib`_pthread_body + 340
    frame #10: 0x00007fff5f1c056d libsystem_pthread.dylib`_pthread_start + 377
    frame #11: 0x00007fff5f1bfc5d libsystem_pthread.dylib`thread_start + 13
(lldb) up
frame #1: 0x00007fff5f006de2 libsystem_c.dylib`__memmove_chk + 22
libsystem_c.dylib`__memmove_chk:
    0x7fff5f006de2 <+22>: movq   %rbx, %rax
    0x7fff5f006de5 <+25>: addq   $0x8, %rsp
    0x7fff5f006de9 <+29>: popq   %rbx
    0x7fff5f006dea <+30>: popq   %rbp
(lldb) up
frame #2: 0x00000000000327dd dx86cl64`create_thread_context_frame(thread=62979, new_stack_top=0x00007000061d3a20, info_ptr=0x00007000061d3a10, tcr=0x0000000000103090, ts=0x00000000000b9040) at x86-exceptions.c:3167
   3164	  stackp = TRUNC_DOWN(stackp, sizeof(*mc), C_STK_ALIGN);
   3165	  mc = (MCONTEXT_T) ptr_from_lispobj(stackp);
   3166	  
-> 3167	  memmove(&(mc->__ss),ts,sizeof(*ts));
   3168	
   3169	  thread_state_count = NATIVE_FLOAT_STATE_COUNT;
   3170	  thread_get_state(thread,
(lldb) p mc
(MCONTEXT_T) $0 = 0x00007000061d6b00
(lldb) p *mc
error: Couldn't apply expression side effects : Couldn't dematerialize a result variable: couldn't read its memory

The thread state:

(lldb) p *ts
(native_thread_state_t) $6 = {
  __rax = 77870
  __rbx = 123145404903392
  __rcx = 40
  __rdx = 7680
  __rdi = 61608
  __rsi = 104
  __rbp = 633444688
  __rsp = 633444648
  __r8 = 88125264
  __r9 = 93190656
  __r10 = 52914056558708
  __r11 = 1061008
  __r12 = 52777632230366
  __r13 = 52914056443295
  __r14 = 633444448
  __r15 = 0
  __rip = 52914056443333
  __rflags = 66050
  __cs = 43
  __fs = 0
  __gs = 0
}

Disassembling at the instruction pointer shows

(lldb) x/10i 52914056443333
    0x30200389b5c5: 0f 7f 3b                       movq   %mm7, (%rbx)
    0x30200389b5c8: 41 0f 6f bb 78 01 00 00        movq   0x178(%r11), %mm7
    0x30200389b5d0: 0f 7f 7b 08                    movq   %mm7, 0x8(%rbx)
    0x30200389b5d4: 49 89 9b 78 01 00 00           movq   %rbx, 0x178(%r11)
    0x30200389b5db: 48 8b 7d e0                    movq   -0x20(%rbp), %rdi
    0x30200389b5df: 48 be 00 00 00 00 00 00 80 00  movabsq $0x80000000000000, %rsi   ; imm = 0x80000000000000 
    0x30200389b5e9: 48 39 f7                       cmpq   %rsi, %rdi
    0x30200389b5ec: 7c 21                          jl     0x30200389b60f
    0x30200389b5ee: 31 f6                          xorl   %esi, %esi
    0x30200389b5f0: 49 8b 83 78 01 00 00           movq   0x178(%r11), %rax
(lldb) 
    0x30200389b5f7: 0f 6f 78 08              movq   0x8(%rax), %mm7
    0x30200389b5fb: 48 8b 00                 movq   (%rax), %rax
    0x30200389b5fe: 49 89 43 50              movq   %rax, 0x50(%r11)
    0x30200389b602: 41 0f 7f bb 78 01 00 00  movq   %mm7, 0x178(%r11)
    0x30200389b60a: 48 89 ec                 movq   %rbp, %rsp
    0x30200389b60d: 5d                       popq   %rbp
    0x30200389b60e: c3                       retq   

This looks like this code is from the save-nfp vinsn / something / restore-nfp

comment:8 Changed 3 years ago by rme

My guess at the moment is that the compiler might not be tracking the nfp correctly. The nfp is used a lot more on the 1.12 branch than it had been in 1.11, and its workings are not 100% clear to me.

comment:9 Changed 3 years ago by rme

The example works normally as of trunk binaries "Version 1.12-dev-r16576M-trunk (DarwinX8664)" fetched from https://trac.clozure.com/ccl/changeset/16579

comment:10 Changed 3 years ago by rme

next step: try binaries referenced in https://trac.clozure.com/ccl/changeset/16612

comment:11 Changed 3 years ago by rme

Worked with r16612 binaries.

comment:12 Changed 3 years ago by rme

Fails at r16752

comment:13 Changed 3 years ago by rme

further bisecting identifies r16704 as the guilty party

comment:15 Changed 3 years ago by rme

  • Resolution set to fixed
  • Status changed from new to closed
Note: See TracTickets for help on using tickets.