Ticket #1026 (closed defect: notabug)

Opened 18 months ago

Last modified 18 months ago

ccl crashes when testing swank-crew

Reported by: brown Owned by: gb
Priority: normal Milestone:
Component: Runtime (threads, GC) Version: 1.8
Keywords: crash Cc:

Description

I recently modified the swank-client and swank-crew packages so that they load into ccl. When I execute (asdf:test-system 'swank-crew) I end up in the Clozure CL kernel debugger. Swank-crew uses threads and locks, so it's possible the code is triggering a Lisp system bug. To reproduce, check out the latest swank-client and swank-crew from their github repositories. Get all other dependencies from quicklisp. Finally, run (asdf:test-system 'swank-crew).

Lisp version: Version 1.8-r15286M (LinuxX8664)

The ARM version of ccl on my Raspberry Pi hangs when I run the swank-crew tests. The debugger does not appear.

Change History

comment:1 Changed 18 months ago by brown

Here is a transcript of a crash.

paradicsom <83877> ccl Welcome to Clozure Common Lisp Version 1.8-r15286M (LinuxX8664)! ? (load "~/quicklisp/setup.lisp") #P"/usr/local/google/home/brown/quicklisp/setup.lisp" ? (load "swank-client/swank-client.asd") #P"/usr/local/google/home/brown/toe/open-source/swank-client/swank-client.asd" ? (load "swank-crew/swank-crew.asd") #P"/usr/local/google/home/brown/toe/open-source/swank-crew/swank-crew.asd" ? (load "swank-crew/swank-crew-test.asd") #P"/usr/local/google/home/brown/toe/open-source/swank-crew/swank-crew-test.asd" ? (asdf:test-system 'swank-crew) ; Warning: Initialized *WORKSPACE-DIRECTORY* using the path of the ASDF system called :hu.dwim.asdf to /usr/local/google/home/brown/quicklisp/dists/quicklisp/software/ (beware of the possibility that it's been symlinked somewhere else) ; While executing: #<Anonymous Function #x30200120FC7F>, in process listener(1). ;Loading #P"/usr/local/google/home/brown/.slime/fasl/2012-10-27/clozure-version_1.8-r15286m(linuxx8664)-linux-x86-64/swank-backend.lx64fsl"... ;Loading #P"/usr/local/google/home/brown/.slime/fasl/2012-10-27/clozure-version_1.8-r15286m(linuxx8664)-linux-x86-64/metering.lx64fsl"... ;Loading #P"/usr/local/google/home/brown/.slime/fasl/2012-10-27/clozure-version_1.8-r15286m(linuxx8664)-linux-x86-64/swank-ccl.lx64fsl"... ;Loading #P"/usr/local/google/home/brown/.slime/fasl/2012-10-27/clozure-version_1.8-r15286m(linuxx8664)-linux-x86-64/swank-gray.lx64fsl"... ;Loading #P"/usr/local/google/home/brown/.slime/fasl/2012-10-27/clozure-version_1.8-r15286m(linuxx8664)-linux-x86-64/swank-match.lx64fsl"... ;Loading #P"/usr/local/google/home/brown/.slime/fasl/2012-10-27/clozure-version_1.8-r15286m(linuxx8664)-linux-x86-64/swank-rpc.lx64fsl"... ;Loading #P"/usr/local/google/home/brown/.slime/fasl/2012-10-27/clozure-version_1.8-r15286m(linuxx8664)-linux-x86-64/swank.lx64fsl"... ; Warning: These Swank interfaces are unimplemented: ; (ACTIVATE-STEPPING ADD-FD-HANDLER ADD-SIGIO-HANDLER ; BACKGROUND-SAVE-IMAGE DUP EXEC-IMAGE MAKE-FD-STREAM ; REMOVE-FD-HANDLERS REMOVE-SIGIO-HANDLERS SLDB-BREAK-AT-START ; SLDB-BREAK-ON-RETURN SLDB-STEP-INTO SLDB-STEP-NEXT SLDB-STEP-OUT ; WAIT-FOR-INPUT) ; While executing: SWANK-BACKEND::WARN-UNIMPLEMENTED-INTERFACES, in process listener(1). ; Warning: No definition for CFLAG-VDSUSP ; While executing: #<Anonymous Function #x3020014CB54F>, in process listener(1). ; Warning: No definition for CFLAG-VSTATUS ; While executing: #<Anonymous Function #x3020014C950F>, in process listener(1). ; Warning: No definition for TIOCGHAYESESP ; While executing: #<Anonymous Function #x3020014C25DF>, in process listener(1). ; Warning: No definition for TIOCSHAYESESP ; While executing: #<Anonymous Function #x3020014C160F>, in process listener(1). ;; Swank started at port: 12345. ;; Swank started at port: 12346. ;; Swank started at port: 12347. ;; Swank started at port: 12348. ;; Swank started at port: 12349. ;; Swank started at port: 12350. ;; Swank started at port: 12351. ;; Swank started at port: 12352. ;; Swank started at port: 12353. ;; Swank started at port: 12354. ;; Swank started at port: 12355. . ;; swank:close-connection: Unexpected end of file on #<BASIC-TUnhandled exception 11 at 0x41cd80, context->regs at #x7f6b37df0aa8 Exception occurred while executing foreign code

at recognize_alloc_instruction + 0

received signal 11; faulting address: 0x33 address not mapped to object ? for help [12272] Clozure CL kernel debugger: B current thread: tcr = 0x7f6b37df2570, native thread ID = 0x3029, interrupts enabled

(#x00007F6B37B8DBF0) #x00003000000E32CC : #<Function FREE #x00003000000E313F> + 397 Bogus frame 7f6b37b8dc20 [12272] Clozure CL kernel debugger: T Current Thread Context Record (tcr) = 0x7f6b37df2570 Control (C) stack area: low = 0x7f6b37b9f000, high = 0x7f6b37df3000 Value (lisp) stack area: low = 0x7f6b37974000, high = 0x7f6b37b8e000 Exception stack pointer = 0x7f6b37df0f38

comment:2 Changed 18 months ago by gb

Actually use WikiFormatting here

 Here is a transcript of a crash.

 paradicsom <83877>  ccl
 Welcome to Clozure Common Lisp Version 1.8-r15286M  (LinuxX8664)!
 ? (load "~/quicklisp/setup.lisp")
 #P"/usr/local/google/home/brown/quicklisp/setup.lisp"
 ? (load "swank-client/swank-client.asd")
 #P"/usr/local/google/home/brown/toe/open-source/swank-client/swank-
 client.asd"
 ? (load "swank-crew/swank-crew.asd")
 #P"/usr/local/google/home/brown/toe/open-source/swank-crew/swank-crew.asd"
 ? (load "swank-crew/swank-crew-test.asd")
 #P"/usr/local/google/home/brown/toe/open-source/swank-crew/swank-crew-
 test.asd"
 ? (asdf:test-system 'swank-crew)
 ; Warning: Initialized *WORKSPACE-DIRECTORY* using the path of the ASDF
 system called :hu.dwim.asdf to
 /usr/local/google/home/brown/quicklisp/dists/quicklisp/software/ (beware
 of the possibility that it's been symlinked somewhere else)
 ; While executing: #<Anonymous Function #x30200120FC7F>, in process
 listener(1).
 ;Loading #P"/usr/local/google/home/brown/.slime/fasl/2012-10-27/clozure-
 version_1.8-r15286m__(linuxx8664)-linux-x86-64/swank-backend.lx64fsl"...
 ;Loading #P"/usr/local/google/home/brown/.slime/fasl/2012-10-27/clozure-
 version_1.8-r15286m__(linuxx8664)-linux-x86-64/metering.lx64fsl"...
 ;Loading #P"/usr/local/google/home/brown/.slime/fasl/2012-10-27/clozure-
 version_1.8-r15286m__(linuxx8664)-linux-x86-64/swank-ccl.lx64fsl"...
 ;Loading #P"/usr/local/google/home/brown/.slime/fasl/2012-10-27/clozure-
 version_1.8-r15286m__(linuxx8664)-linux-x86-64/swank-gray.lx64fsl"...
 ;Loading #P"/usr/local/google/home/brown/.slime/fasl/2012-10-27/clozure-
 version_1.8-r15286m__(linuxx8664)-linux-x86-64/swank-match.lx64fsl"...
 ;Loading #P"/usr/local/google/home/brown/.slime/fasl/2012-10-27/clozure-
 version_1.8-r15286m__(linuxx8664)-linux-x86-64/swank-rpc.lx64fsl"...
 ;Loading #P"/usr/local/google/home/brown/.slime/fasl/2012-10-27/clozure-
 version_1.8-r15286m__(linuxx8664)-linux-x86-64/swank.lx64fsl"...
 ; Warning: These Swank interfaces are unimplemented:
 ;           (ACTIVATE-STEPPING ADD-FD-HANDLER ADD-SIGIO-HANDLER
 ;            BACKGROUND-SAVE-IMAGE DUP EXEC-IMAGE MAKE-FD-STREAM
 ;            REMOVE-FD-HANDLERS REMOVE-SIGIO-HANDLERS SLDB-BREAK-AT-START
 ;            SLDB-BREAK-ON-RETURN SLDB-STEP-INTO SLDB-STEP-NEXT SLDB-STEP-
 OUT
 ;            WAIT-FOR-INPUT)
 ; While executing: SWANK-BACKEND::WARN-UNIMPLEMENTED-INTERFACES, in
 process listener(1).
 ; Warning: No definition for CFLAG-VDSUSP
 ; While executing: #<Anonymous Function #x3020014CB54F>, in process
 listener(1).
 ; Warning: No definition for CFLAG-VSTATUS
 ; While executing: #<Anonymous Function #x3020014C950F>, in process
 listener(1).
 ; Warning: No definition for TIOCGHAYESESP
 ; While executing: #<Anonymous Function #x3020014C25DF>, in process
 listener(1).
 ; Warning: No definition for TIOCSHAYESESP
 ; While executing: #<Anonymous Function #x3020014C160F>, in process
 listener(1).
 ;; Swank started at port: 12345.
 ;; Swank started at port: 12346.
 ;; Swank started at port: 12347.
 ;; Swank started at port: 12348.
 ;; Swank started at port: 12349.
 ;; Swank started at port: 12350.
 ;; Swank started at port: 12351.
 ;; Swank started at port: 12352.
 ;; Swank started at port: 12353.
 ;; Swank started at port: 12354.
 ;; Swank started at port: 12355.
 .
 ;; swank:close-connection: Unexpected end of file on #<BASIC-TUnhandled
 exception 11 at 0x41cd80, context->regs at #x7f6b37df0aa8
 Exception occurred while executing foreign code
  at recognize_alloc_instruction + 0
 received signal 11; faulting address: 0x33
 address not mapped to object
 ? for help
 [12272] Clozure CL kernel debugger: B
 current thread: tcr = 0x7f6b37df2570, native thread ID = 0x3029,
 interrupts enabled


 (#x00007F6B37B8DBF0) #x00003000000E32CC : #<Function FREE #x00003000000E313F> + 397
 Bogus frame 7f6b37b8dc20
 [12272] Clozure CL kernel debugger: T
 Current Thread Context Record (tcr) = 0x7f6b37df2570
 Control (C) stack area:  low = 0x7f6b37b9f000, high = 0x7f6b37df3000
 Value (lisp) stack area: low = 0x7f6b37974000, high = 0x7f6b37b8e000
 Exception stack pointer = 0x7f6b37df0f38

comment:3 Changed 18 months ago by gb

  • Owner set to gb
  • Status changed from new to assigned

I tried to reproduce this and saw the problem described above.

Subsequent attempts to reproduce that problem weren't successful, though I often saw the system deadlock. The reasons for the deadlock are fairly easy to understand, though I saw the responsible code from the bottom up and don't really know how it fits into your code. SWANK::CLEANUP-CONNECTION-THREADS calls SWANK::KILL-THREAD on a set of threads, and the definition of KILL-THREAD is as follows:

(defimplementation kill-thread (thread)
  ;;(ccl:process-kill thread) ; doesn't cut it
  (ccl::process-initial-form-exited thread :kill))

In other words, it calls an undocumented internal function to kill a thread, rather than using a documented and exported function. I have no idea what the comment ("doesn't cut it") means and suspect that I don't want to know; I'm selfishly glad that I don't use SLIME but I suspect that many people do and that they'd be affected by this.

CCL:PROCESS-KILL tries to force a thread to shut itself down in an orderly way (running UNWIND-PROTECT cleanup forms to release locks, close files, etc.)

The internal function - when called with the :kill option - just kills the underlying OS-level thread at the moment (though since it's an undocumented internal function it might disappear or do something else tomorrow.) If the thread owns a lock when it's killed this way, the lock will continue to be owned indefinitely by the (now dead) thread and any other threads that try to obtain that lock will wait forever for the dead thread to release the lock, and that obviously doesn't cut it.

This is definitely an issue, and I'd strongly recommend changing the CCL-specific implementation of KILL-THREAD to use CCL:PROCESS-KILL. I don't know with any great certainty that this problem ("abrupt" vs "orderly" thread termination) is responsible for the trip to the kernel debugger that you reported, but can imagine that it might be. (The symptom of that crash looks like "a failure to handle POSIX signals in a thread- and GC-safe way"; the abrupt thread-termination mechanism involves sending a POSIX signal to the thread being terminated, and I suspect that that signal isn't handled in a thread- and GC-safe way ...)

The hang that you saw on the ARM may very well be what I'm referring to as deadlock.

comment:4 Changed 18 months ago by gb

  • Status changed from assigned to closed
  • Resolution set to notabug

I tried your test case with KILL-THREAD redefined to use CCL:PROCESS-KILL and saw no recurrence of the apparent signal-handling problem that led to the trip to the kernel debugger.

There are all kinds of other problems - notably including a complaint that SWANK::*SEND-COUNTER* is unbound that seems to affect some swank condition handling code (which is invoked every time that the UNBOUND-VARIABLE condition is signaled ...) so some thread seems to be slowly overflowing its stack. Nothing that I see suggests that CCL's doing anything other than what it's told to do, and I've already expressed sympathy about that.

(The signal-handling issue is a real one; I'd feel a lot more urgency to fix it than I do if it wasn't related to (mis-)use of a questionable internal function.)

I'm closing this with a resolution of "not a bug", since I don't think that the fact that misuse of a questionable internal function leads to unexpected behavior should be particularly surprising to anyone. If you see some aspect of a CCL bug here, feel free to reopen this.

comment:5 Changed 18 months ago by brown

Thanks very much for taking a look at this bug report. I hadn't noticed that the Slime code was using internal CCL functions. With your suggested change to Slime's thread killing function and another fix to Slime's code by me to deal with the unbound SWANK::*SEND-COUNTER* variable, all the swank-crew tests now pass.

Thanks again for taking a look. I'm sorry I wasted your time on a report that turned out not to be a CCL problem. I'll report the Slime bugs to the author.

bob

Note: See TracTickets for help on using tickets.