Opened 10 years ago

Closed 9 years ago

#677 closed defect (fixed)

saving an application hangs

Reported by: hrapof Owned by: gb
Priority: blocker Milestone:
Component: Compiler Version: trunk
Keywords: Cc:

Description

CCL 1.5, Windows XP, x86 Given a source file danderoli.lisp,

(ccl:save-application
   "danderoli.cgi"
   :error-handler :quit :prepend-kernel t))

hangs (danderoli.cgi gets created, but is only 1731 Kb), while

(ccl:save-application
   "dand.cgi"
   :error-handler :quit :prepend-kernel t))

doesn't. The documentation doesn't seem to mention it. Is it a bug or am I missing something?

Change History (10)

comment:1 Changed 10 years ago by hrapof

  • Priority changed from minor to blocker

I was wrong :( It doesn't depend on executable name... Don't know why, but saving an application just hangs on 1731 Kb. Will try to reboot...

comment:2 Changed 10 years ago by hrapof

Reboot helped (i.e. it got saved).

comment:3 follow-up: Changed 10 years ago by lispwizard

I've seen the Windows ccl hang under XP in other circumstances, so this may not have had anything to do with saving an application per se.

I've seen intermittent hangs in the context of a compute bound loop periodically doing output (via format t). In my situation, a C followed by a :go allowed things to continue. The stack trace at the time of the C showed that it was in the middle of trying to do output under format, i.e.:

(11E9BF4) : 0 (FUNCALL #'#<#<CCL::STANDARD-KERNEL-METHOD STREAM-FRESH-LINE (CCL::CHARACTER-OUTPUT-STREAM)>> #<BASIC-CHARACTER-OUTPUT-STREAM ISO-8859-1 (PIPE/1796) #xC29152E>) 63 (11E9C00) : 1 (FUNCALL #'#<CCL::FORMAT-&> #<SYNONYM-STREAM to *TERMINAL-IO* #xC290B76> NIL NIL NIL) 199 (11E9C18) : 2 (SUB-FORMAT #<SYNONYM-STREAM to *TERMINAL-IO* #xC290B76> 1 36) 2335 (11E9C6C) : 3 (DO-SUB-FORMAT #<SYNONYM-STREAM to *TERMINAL-IO* #xC290B76>) 143

I originally saw this hang while running under Emacs where the output stream is a pipe, but then I reproduced it running under both a cygwin ksh and a native cmd.exe command loop, so I don't think it was due to any interaction with the outside world.

It doesn't happen all the time, but with a 550 file 15 minute test case (it processes each file for several seconds before writing out its line) it happens about 50% of the time. It seems to happen more frequently if the machine is busy in the background as well, e.g. doing disk-to-disk copying.

I looked at the process while it was hanging with the visual studio debugger, but since it doesn't have symbols it was fairly hard to figure out what was going on.

It seems that thread which visual studio calls _endthreadx was looping in sem_wait_forever with approximately this stack trace:

_handle_exception -> _handle_allocate_trap -> _allocate_object-> _purify_from_xp -> _gc_like_from_xp -> _suspend_other_threads -> _suspend_tcr -> _sem_wait_forever

I could force it out of this loop (by making the return code from WaitForSingleObject? not equal to 0x102 [WAIT_TIMEOUT]), and this got my code running again just as the C did before.

Hope this helps someone track this bug down. -- Kalman

comment:4 in reply to: ↑ description Changed 10 years ago by gb

  • Owner set to gb
  • Status changed from new to assigned

Replying to hrapof:

CCL 1.5, Windows XP, x86 Given a source file danderoli.lisp,

(ccl:save-application
   "danderoli.cgi"
   :error-handler :quit :prepend-kernel t))

hangs (danderoli.cgi gets created, but is only 1731 Kb), while

(ccl:save-application
   "dand.cgi"
   :error-handler :quit :prepend-kernel t))

doesn't. The documentation doesn't seem to mention it. Is it a bug or am I missing something?

I missed this; sorry for not replying earlier. Two questions (if you remember):

How big is the kernel (wx86cl.exe) that's being prepended ? (If it's ~1731 kb, that's easier to understand than if it's substantially different in size.)

Were you running SLIME ? (SLIME exacerbates a race condition: QUIT or SAVE-APPLICATION try to arrange for the initial thread to kill other threads in a hopefully orderly manner, some SLIME thread notices that it's had a PROCESS-RESET condition signaled and decides to print a message to a shared stream, the initial thread gets tired of waiting for that thread to die an orderly death and kills it abruptly, then tries to FORCE-OUTPUT on that stream and deadlocks because a lock is still held by the thread that was writing on the stream.

SAVE-APPLICATION ... :PREPEND-KERNEL T writes a copy of the kernel to the image file from lisp code, then tries to shutdown lisp threads and does the actual work of writing the heap image in the lisp kernel and in a single-threaded environment. If the kernel got written but the lisp deadlocked in a way like that described above, that'd be understandable (but the kernel exe that I have here is only about 1000 kb; it'd be harder to understand how the kernel and some of the image could get written, since there isn't much that the image-writing code would need to wait for.)

comment:5 in reply to: ↑ 3 Changed 10 years ago by gb

Replying to lispwizard:

I looked at the process while it was hanging with the visual studio debugger, but since it doesn't have symbols it was fairly hard to figure out what was going on.

CCL's kernel is built with GCC (either under Cygwin or MinGW); AFAIK, the MS debuggers barely understand the symbolic information that GCC produces (and GDB barely understands the information produced by MS tools.)

Debugging with GDB is much more likely to be productive. (I use the Cygwin GDB fairly regularly and believe that I've also used a MinGW version.)

It seems that thread which visual studio calls _endthreadx was looping in sem_wait_forever with approximately this stack trace:

_handle_exception -> _handle_allocate_trap -> _allocate_object-> _purify_from_xp -> _gc_like_from_xp -> _suspend_other_threads -> _suspend_tcr -> _sem_wait_forever

"purify" is ordinarily called by SAVE-APPLICATION; when called in that context, the lisp should be running single-threaded. It's hard to know what this has to do with the rest of your message where you describe scenarios where a compute-bound process seemingly becomes unresponsive (but responds to C.)

It may just be the case that the backtrace above is incorrect, since allocate_object doesn't call purify. allocate_object does take a similar path through 'gc_like_from_xp' and the rest of the backtrace when it decides that GC is necessary, so it may just be the case that a thread was entering the GC and purify is not in fact involved.

As the backtrace suggests, when a thread enters the GC (or something "gc-like"), it tries to suspend all other lisp threads. In some scenarios on Windows - where it appears that a thread is in the process of entering an exception handler - it's necessary to set a bit and allow the thread to resume; when the target thread enters the exception handler, it checks this bit and raises a semaphore to acknowledge the suspend request. The GC thread is waiting forever for this acknowledgement, as it should. If you interrupt that infinite wait, you might get lucky: the thread that never acknowledged the suspend request might not try to execute Lisp code while the GC is running (and believing it to be suspended.) Or maybe not.

In any situation where the GC thread would wait forever and not get the suspend acknowlegement that it's expecting, any other strategy - waiting with timeouts, sampling the state of the thread repeatedly - would have its own problems. The real underlying problem is a discrepancy between the advertised state of the thread (wrt to whether it's running Lisp or foreign code or transitioning between those states) and its actual state, and once that sort of inconsistency/discrepancy exists, you're basically hosed.

On Unix versions of CCL, exception-related state transitions aren't visible to the suspend/resume mechanism: that mechanism's signal-based, and it's the nature of signal handling and signal masks that those transitions (from lisp to an exception handler and back) happen with signals masked: a thread can't be suspended "in the middle" of such a transition. On Windows, we try to recognize and manage those transitions, and I don't know how you can get into the state you describe unless some aspect of that is handled incorrectly. I haven't seen this sort of symptom of failure to handle those transitions in a long time.

comment:6 Changed 9 years ago by rme

r13869 contains a possible fix for this (to the trunk).

Could you please update from svn and see if the problem persists?

comment:7 Changed 9 years ago by lispwizard

Sorry this took so long, but I tried reproducing my hang in the trunk (as of this morning) and failed utterly, so it looks like the hang I described was fixed. (I was still able to reproduce it in 1.5 on the same machine.)

comment:8 Changed 9 years ago by gb

The trunk changes in r13869 were propagated to 1.5 as r13898, so it would seem that either those changes didn't fix the problem or you aren't running a 1.5 version that incorporates them.

comment:9 Changed 9 years ago by lispwizard

The image I'm running which exhibits the bug reports 1.5-r13651. Perhaps that was the trunk at the time I first reported the problem? (I don't remember exactly what I downloaded when and this machine has had a disk failure in the meantime and was recovered from an older backup.)

comment:10 Changed 9 years ago by rme

  • Resolution set to fixed
  • Status changed from assigned to closed
Note: See TracTickets for help on using tickets.