Ticket #980 (closed defect: fixed)
Race condition(?) during image dump
|Reported by:||fare||Owned by:|
|Component:||Runtime (threads, GC)||Version:||1.8|
Once in a while, dumping an image using CCL yields a dysfunctional executable that crashes with low-level errors when used.
My guess, coherent with the symptoms, is that there is a race condition that causes threads to not be cleaned up properly before the image is dumped, and the runtime is subsequently confused when it resumes from such an messed up state. If correct, then a solution would then better to synchronize with threads being terminated, and/or to teach the runtime to clean up the mess when re-starting.
I've observed the symptom in the past, using CCL 1.7 while compiling QRes and probably earlier versions of CCL. But there was too much state to be reproducible, what more involving proprietary software. Today, I reproduced the bug with CCL 1.8, during the early steps of an XCVB build, therefore with very little state if at all, while testing XCVB itself, all of it free software. Attached will be the log of all XCVB testing (27MB uncompressed, .5MB bzip2'ed) - search for TYPE-ERROR near the end, and the resulting dysfunctional image (27MB uncompressed, 5MB bzip2'ed).
Ubuntu Lucid on amd64 linux-image-2.6.32-41-generic libc6-2.11.1-0ubuntu7.10 Clozure Common Lisp Version 1.8-r15359M (LinuxX8664) XCVB 0.579-5-gb4086ee
The (reformatted) commands run to build the dysfunctional image were:
ccl --no-init --quiet --batch --eval \ "(let ((x (multiple-value-bind (output warningp failurep) (let ((*default-pathname-defaults* (truename *default-pathname-defaults*))) (handler-bind (((or #+sbcl sb-c::simple-compiler-note #+ecl c::compiler-note #+ecl c::compiler-debug-note #+ecl c::compiler-warning) #'muffle-warning)) (compile-file \"/tmp/tunes/xcvb-release/xcvb/driver.lisp\" :verbose nil :print nil :output-file (merge-pathnames #P\"/tmp/tunes/xcvb-test/obj/xcvb/driver__temp.lx64fsl\")))) (if (or (not output) #-(or clisp ecl) warningp #-clisp failurep) 1 0)))) (finish-output *standard-output*) (finish-output *error-output*) (ccl:quit x))" mv /tmp/tunes/xcvb-test/obj/xcvb/driver__temp.lx64fsl \ /tmp/tunes/xcvb-test/obj/xcvb/driver.lx64fsl xcvb make-manifest --output \ /tmp/tunes/xcvb-test/obj//___initial.manifest --spec \ "((:command (:load-file (:fasl \"/xcvb/driver\")) :pathname \"/tmp/tunes/xcvb-test/obj//xcvb/driver.lx64fsl\"))" ccl --no-init --quiet --batch --load \ /tmp/tunes/xcvb-test/obj//xcvb/driver.lx64fsl --eval \ "(xcvb-driver::run (:create-image (#P\"/tmp/tunes/xcvb-test/obj/___temp.image\" :output-name \"_\" ) (:initialize-manifest \"/tmp/tunes/xcvb-test/obj//___initial.manifest\")))" mv /tmp/tunes/xcvb-test/obj/___temp.image \ /tmp/tunes/xcvb-test/obj/_.image
The next command, that triggered the error, was:
/tmp/tunes/xcvb-test/obj/_.image --no-init --quiet --batch --eval \ "(xcvb-driver::run (:compile-lisp (\"/tmp/tunes/xcvb-release/xcvb/examples/example-1/package.lisp\" #P\"/tmp/tunes/xcvb-test/obj/xcvb/example-1/package__temp.lx64fsl\")))"
The (reformatted) error output before the process exited with error code 255 was:
> Error of type TYPE-ERROR: value 0 is not of the expected type (OR CCL::NAMED-CTYPE CCL::NUMERIC-CTYPE CCL::ARRAY-CTYPE CCL::MEMBER-CTYPE CCL::CLASS-CTYPE CCL::UNION-CTYPE CCL::INTERSECTION-CTYPE CCL::CONS-CTYPE CCL::UNKNOWN-CTYPE CCL::NEGATION-CTYPE CCL::HAIRY-CTYPE CCL::FUNCTION-CTYPE). > While executing: CCL::%%TYPEP, in process listener(1). > Error of type CCL::INVALID-MEMORY-ACCESS: Fault during read of memory address #x3 > While executing: CCL::CELL-CSUBTYPEP-2, in process listener(1). > Error of type CCL::INVALID-MEMORY-ACCESS: Fault during read of memory address #x3 > While executing: CCL::CELL-CSUBTYPEP-2, in process listener(1). > Error of type CCL::INVALID-MEMORY-ACCESS: Fault during read of memory address #x3 > While executing: CCL::CELL-CSUBTYPEP-2, in process listener(1). > Error of type CCL::INVALID-MEMORY-ACCESS: Fault during read of memory address #x3 > While executing: CCL::CELL-CSUBTYPEP-2, in process listener(1).
- while I have single-threaded-ccl installed (git revision 608cd4), it wasn't used at that time. Which is consistent with the error being a threading race condition.
- while this is kernel 2.6.32, it doesn't have CONFIG_PREEMPT enabled, only CONFIG_PREEMPT_VOLUNTARY, so it is likely not the same kernel bug as was causing ITA bug 87457.
- the machine was lightly loaded with web browsing at the time I was running the test. I suspect that increasing CPU load might increase the chances at reproducing the bug.
- loading the driver fasl is a very very thin wrapper over directly calling ccl:save-application. You might be able to reproduce the bug just by calling plenty of save-applications in parallel.
- I didn't spend as much time writing this bug report the second time (see bug 979), and this time, I'm saving it to a file before to submit it.
Xref: ITA bug 106299