Opened 7 years ago

Closed 7 years ago

Last modified 7 years ago

#1029 closed defect (invalid)

segfault during GC

Reported by: fare Owned by:
Priority: normal Milestone:
Component: Runtime (threads, GC) Version: 1.8
Keywords: ITA Cc:


While running our code in production, we have been catching this "interesting" segfault, seemingly in the middle of garbage collection.

We strongly suspect our code was trying to print an XML message at that time using libxml2, which may or may not suggest some interaction between GC and FFI.

Xref: ITA bug 122988

Attachments (2)

122988.2.log (11.4 KB) - added by fare 7 years ago.
CCL's final words
122988.log (11.4 KB) - added by fare 7 years ago.
CCL's final words

Download all attachments as: .zip

Change History (6)

Changed 7 years ago by fare

CCL's final words

comment:1 Changed 7 years ago by fare

(I swear I only clicked once; was it my computer duplicating the click, or your server?)

comment:2 Changed 7 years ago by gb

About all that it's possible to say from looking at the backtrace is that this is a symptom of some sort of memory corruption (something that should be a valid lisp object looks to the GC like a vector with an implausibly large number of elements), and that this is due to:

  • a bug in CCL
  • a bug in your code
  • a bug in third party code.

It's not really possible to narrow that down further from the information provided.

You could cause this kind if heap corruption by doing something like:

(let* ((x (make-array 1 :element-type '(signed-byte 64)))
       (y (make-array 1 :element-type '(signed-byte 64))))
    (declare (optimize (speed 3) (safety 0)))
    ;; Note that this code is setting element 1 of a 1-element vector; under
    ;; less aggressive optimization settings, an error would be signaled here.
    ;; Under these optimization settings, X is likely clobbered.
    (setf (aref y 1) (dpb target::subtag-s64-vector
                          (byte 8 0)

That's likely to be very reproducible; whatever's actually causing the problem is possibly much more subtle than the example above. "Incorrect and unsafe code" is likely responsible, but I have no way of knowing what code is actually responsible.

Many CL libraries contain declamations like:

(declaim (optimize (speed 3) (safety 0)))

when they mean to say

(eval-when (:compile-toplevel)
  (declaim (optimize (speed 3) (safety 0))))

and the fact that the declamation has (unintended) load-time effects means that other code can be compiled with those unsafe settings in effect. I don't know if that's a factor in your case, but it certainly could be.

There are other ways (besides incorrect/unsafe Lisp code) that can cause lisp memory to be corrupted, but that's probably the most common way and is the first thing to look at.

What code is executing when the GC runs and chokes on a corrupted object isn't necessarily relevant; anything that's run since the last time that the GC ran (and didn't choke) could be.

It is possible that the corruption that the GC is choking on is a secondary (or tertiary, or ...) effect of something less drastic that happened several GCs earlier (and that things get slightly worse each GC until they're bad enough to cause a segfault.)

The GC ordinarily assumes that the heap isn't corrupt. You can set a bit in the value of a variable that's visible to the GC:

(setq ccl::*gc-event-status-bits* 4)

to force it to do fairly rigorous integrity checks before and after it runs. (Those checks are typically much more expensive than the GC itself is, but they have a better chance of catching problems before they snowball into something that'd cause the the GC to segfault.)

Exactly how the CCL compiler is affected by OPTIMIZE settings is described (fairly well) at . Compiling your application under a POLICY that disallows unsafe optimizations (regardless of optimization settings) may be easier than wrestling with ill-considered declamations.

If this problem is reliably reproducible, it may be possible to debug it using GDB. (If you can find what's getting stepped on, GDB's watchpoint support can make it relatively easy to determine what's doing the stepping on ...) This can be incredibly time-consuming if you understand a lot about CCL internals and would likely be many times more so if you aren't. I've done this, and it's been a very long time since the results of doing so indicated a problem in the GC.

Incidentally, if this server was "duplicating clicks" or doing other things that you've decided it's doing, it seems likely that it would do so for someone other than you.

comment:3 Changed 7 years ago by fare

  • Resolution set to invalid
  • Status changed from new to closed

OK. Thanks a whole lot for the diagnosis!

Let's close this bug for now as "invalid", since the cause is much more likely unsafe code on our side than in CCL itself. We'll come back to you if we have any evidence incriminating CCL.

(PS: I'll blame my laptop for double clicking)

On the other hand, is there any particular reason why CCL would fail to dump a core? I believe we didn't have ulimit settings against it.

comment:4 Changed 7 years ago by gb

All that CCL can do to try to force a core dump is to call abort(), which causes the process to send itself an ABORT signal (SIGABRT, however that's spelled.) If SIGABRT isn't handled, the default behavior is to generate a core dump (if ulimit settings allow that.)

You seem to be running some sort of SIGABRT handler, which prints a C backtrace. I don't know where that's coming from, but that's not part of CCL.

comment:5 Changed 7 years ago by gz

  • Keywords ITA added
Note: See TracTickets for help on using tickets.