Ticket #1029 (closed defect: invalid)

Opened 21 months ago

Last modified 21 months ago

segfault during GC

Reported by: fare Owned by:
Priority: normal Milestone:
Component: Runtime (threads, GC) Version: 1.8
Keywords: ITA Cc:

Description

While running our code in production, we have been catching this "interesting" segfault, seemingly in the middle of garbage collection.

We strongly suspect our code was trying to print an XML message at that time using libxml2, which may or may not suggest some interaction between GC and FFI.

Xref: ITA bug 122988

Attachments

122988.2.log Download (11.4 KB) - added by fare 21 months ago.
CCL's final words
122988.log Download (11.4 KB) - added by fare 21 months ago.
CCL's final words

Change History

Changed 21 months ago by fare

CCL's final words

comment:1 Changed 21 months ago by fare

(I swear I only clicked once; was it my computer duplicating the click, or your server?)

comment:2 Changed 21 months ago by gb

About all that it's possible to say from looking at the backtrace is that this is a symptom of some sort of memory corruption (something that should be a valid lisp object looks to the GC like a vector with an implausibly large number of elements), and that this is due to:

  • a bug in CCL
  • a bug in your code
  • a bug in third party code.

It's not really possible to narrow that down further from the information provided.

You could cause this kind if heap corruption by doing something like:

(let* ((x (make-array 1 :element-type '(signed-byte 64)))
       (y (make-array 1 :element-type '(signed-byte 64))))
    (declare (optimize (speed 3) (safety 0)))
    ;; Note that this code is setting element 1 of a 1-element vector; under
    ;; less aggressive optimization settings, an error would be signaled here.
    ;; Under these optimization settings, X is likely clobbered.
    (setf (aref y 1) (dpb target::subtag-s64-vector
                          (byte 8 0)
                          -1))
    (ccl:gc))

That's likely to be very reproducible; whatever's actually causing the problem is possibly much more subtle than the example above. "Incorrect and unsafe code" is likely responsible, but I have no way of knowing what code is actually responsible.

Many CL libraries contain declamations like:

(declaim (optimize (speed 3) (safety 0)))

when they mean to say

(eval-when (:compile-toplevel)
  (declaim (optimize (speed 3) (safety 0))))

and the fact that the declamation has (unintended) load-time effects means that other code can be compiled with those unsafe settings in effect. I don't know if that's a factor in your case, but it certainly could be.

There are other ways (besides incorrect/unsafe Lisp code) that can cause lisp memory to be corrupted, but that's probably the most common way and is the first thing to look at.

What code is executing when the GC runs and chokes on a corrupted object isn't necessarily relevant; anything that's run since the last time that the GC ran (and didn't choke) could be.

It is possible that the corruption that the GC is choking on is a secondary (or tertiary, or ...) effect of something less drastic that happened several GCs earlier (and that things get slightly worse each GC until they're bad enough to cause a segfault.)

The GC ordinarily assumes that the heap isn't corrupt. You can set a bit in the value of a variable that's visible to the GC:

(setq ccl::*gc-event-status-bits* 4)

to force it to do fairly rigorous integrity checks before and after it runs. (Those checks are typically much more expensive than the GC itself is, but they have a better chance of catching problems before they snowball into something that'd cause the the GC to segfault.)

Exactly how the CCL compiler is affected by OPTIMIZE settings is described (fairly well) at http://trac.clozure.com/ccl/wiki/DeclareOptimize . Compiling your application under a POLICY that disallows unsafe optimizations (regardless of optimization settings) may be easier than wrestling with ill-considered declamations.

If this problem is reliably reproducible, it may be possible to debug it using GDB. (If you can find what's getting stepped on, GDB's watchpoint support can make it relatively easy to determine what's doing the stepping on ...) This can be incredibly time-consuming if you understand a lot about CCL internals and would likely be many times more so if you aren't. I've done this, and it's been a very long time since the results of doing so indicated a problem in the GC.

Incidentally, if this server was "duplicating clicks" or doing other things that you've decided it's doing, it seems likely that it would do so for someone other than you.

comment:3 Changed 21 months ago by fare

  • Status changed from new to closed
  • Resolution set to invalid

OK. Thanks a whole lot for the diagnosis!

Let's close this bug for now as "invalid", since the cause is much more likely unsafe code on our side than in CCL itself. We'll come back to you if we have any evidence incriminating CCL.

(PS: I'll blame my laptop for double clicking)

On the other hand, is there any particular reason why CCL would fail to dump a core? I believe we didn't have ulimit settings against it.

comment:4 Changed 21 months ago by gb

All that CCL can do to try to force a core dump is to call abort(), which causes the process to send itself an ABORT signal (SIGABRT, however that's spelled.) If SIGABRT isn't handled, the default behavior is to generate a core dump (if ulimit settings allow that.)

You seem to be running some sort of SIGABRT handler, which prints a C backtrace. I don't know where that's coming from, but that's not part of CCL.

comment:5 Changed 21 months ago by gz

  • Keywords ITA added
Note: See TracTickets for help on using tickets.