Ticket #980 (closed defect: fixed)

Opened 3 years ago

Last modified 3 years ago

Race condition(?) during image dump

Reported by: fare Owned by:
Priority: normal Milestone:
Component: Runtime (threads, GC) Version: 1.8
Keywords: Cc:

Description

Once in a while, dumping an image using CCL yields a dysfunctional executable that crashes with low-level errors when used.

My guess, coherent with the symptoms, is that there is a race condition that causes threads to not be cleaned up properly before the image is dumped, and the runtime is subsequently confused when it resumes from such an messed up state. If correct, then a solution would then better to synchronize with threads being terminated, and/or to teach the runtime to clean up the mess when re-starting.

I've observed the symptom in the past, using CCL 1.7 while compiling QRes and probably earlier versions of CCL. But there was too much state to be reproducible, what more involving proprietary software. Today, I reproduced the bug with CCL 1.8, during the early steps of an XCVB build, therefore with very little state if at all, while testing XCVB itself, all of it free software. Attached will be the log of all XCVB testing (27MB uncompressed, .5MB bzip2'ed) - search for TYPE-ERROR near the end, and the resulting dysfunctional image (27MB uncompressed, 5MB bzip2'ed).

Ubuntu Lucid on amd64
linux-image-2.6.32-41-generic
libc6-2.11.1-0ubuntu7.10
Clozure Common Lisp Version 1.8-r15359M  (LinuxX8664)
XCVB 0.579-5-gb4086ee

The (reformatted) commands run to build the dysfunctional image were:

ccl --no-init --quiet --batch --eval \
  "(let ((x
          (multiple-value-bind (output warningp failurep)
              (let ((*default-pathname-defaults*
                     (truename *default-pathname-defaults*)))
                (handler-bind (((or #+sbcl sb-c::simple-compiler-note
	                              #+ecl c::compiler-note
			              #+ecl c::compiler-debug-note
			              #+ecl c::compiler-warning)
				  #'muffle-warning))
                  (compile-file \"/tmp/tunes/xcvb-release/xcvb/driver.lisp\"
                    :verbose nil :print nil :output-file
		      (merge-pathnames
		       #P\"/tmp/tunes/xcvb-test/obj/xcvb/driver__temp.lx64fsl\"))))
	      (if (or (not output) #-(or clisp ecl) warningp #-clisp failurep)
	        1 0))))
     (finish-output *standard-output*) (finish-output *error-output*)
     (ccl:quit x))"

mv /tmp/tunes/xcvb-test/obj/xcvb/driver__temp.lx64fsl \
  /tmp/tunes/xcvb-test/obj/xcvb/driver.lx64fsl

xcvb make-manifest --output \
  /tmp/tunes/xcvb-test/obj//___initial.manifest --spec \
  "((:command (:load-file (:fasl \"/xcvb/driver\"))
     :pathname \"/tmp/tunes/xcvb-test/obj//xcvb/driver.lx64fsl\"))"

ccl --no-init --quiet --batch --load \
  /tmp/tunes/xcvb-test/obj//xcvb/driver.lx64fsl --eval \
  "(xcvb-driver::run
     (:create-image (#P\"/tmp/tunes/xcvb-test/obj/___temp.image\"
                    :output-name \"_\" )
       (:initialize-manifest
         \"/tmp/tunes/xcvb-test/obj//___initial.manifest\")))"

mv /tmp/tunes/xcvb-test/obj/___temp.image \
  /tmp/tunes/xcvb-test/obj/_.image

The next command, that triggered the error, was:

/tmp/tunes/xcvb-test/obj/_.image --no-init --quiet --batch --eval \
  "(xcvb-driver::run
     (:compile-lisp
       (\"/tmp/tunes/xcvb-release/xcvb/examples/example-1/package.lisp\"
        #P\"/tmp/tunes/xcvb-test/obj/xcvb/example-1/package__temp.lx64fsl\")))"

The (reformatted) error output before the process exited with error code 255 was:

> Error of type TYPE-ERROR: value 0 is not of the expected type
(OR CCL::NAMED-CTYPE CCL::NUMERIC-CTYPE CCL::ARRAY-CTYPE CCL::MEMBER-CTYPE
    CCL::CLASS-CTYPE CCL::UNION-CTYPE CCL::INTERSECTION-CTYPE CCL::CONS-CTYPE
    CCL::UNKNOWN-CTYPE CCL::NEGATION-CTYPE CCL::HAIRY-CTYPE CCL::FUNCTION-CTYPE).
> While executing: CCL::%%TYPEP, in process listener(1).
> Error of type CCL::INVALID-MEMORY-ACCESS: Fault during read of memory address #x3
> While executing: CCL::CELL-CSUBTYPEP-2, in process listener(1).
> Error of type CCL::INVALID-MEMORY-ACCESS: Fault during read of memory address #x3
> While executing: CCL::CELL-CSUBTYPEP-2, in process listener(1).
> Error of type CCL::INVALID-MEMORY-ACCESS: Fault during read of memory address #x3
> While executing: CCL::CELL-CSUBTYPEP-2, in process listener(1).
> Error of type CCL::INVALID-MEMORY-ACCESS: Fault during read of memory address #x3
> While executing: CCL::CELL-CSUBTYPEP-2, in process listener(1).

Note that:

  • while I have single-threaded-ccl installed (git revision 608cd4), it wasn't used at that time. Which is consistent with the error being a threading race condition.
  • while this is kernel 2.6.32, it doesn't have CONFIG_PREEMPT enabled, only CONFIG_PREEMPT_VOLUNTARY, so it is likely not the same kernel bug as was causing ITA bug 87457.
  • the machine was lightly loaded with web browsing at the time I was running the test. I suspect that increasing CPU load might increase the chances at reproducing the bug.
  • loading the driver fasl is a very very thin wrapper over directly calling ccl:save-application. You might be able to reproduce the bug just by calling plenty of save-applications in parallel.
  • I didn't spend as much time writing this bug report the second time (see bug 979), and this time, I'm saving it to a file before to submit it.

Xref: ITA bug 106299

Attachments

a.bz2 Download (581.1 KB) - added by rme 3 years ago.
log
_.image.bz2 Download (5.1 MB) - added by rme 3 years ago.
image

Change History

comment:1 Changed 3 years ago by fare

Well, I can't attach either the image or the log after all, since you have limit of .25MiB on attachments. I will email both of them to rme.

Changed 3 years ago by rme

log

Changed 3 years ago by rme

image

comment:2 Changed 3 years ago by rme

Increased Trac attachment limit size to 6MB; added attachments here.

comment:3 Changed 3 years ago by gb

(In [15405]) Code which updates the type cache can't allow interrupts. See ticket:980.

comment:4 Changed 3 years ago by gb

  • Status changed from new to closed
  • Resolution set to fixed

(In [15408]) Propagate r15405 to 1.8. Fixes ticket:980 in 1.8.

Note: See TracTickets for help on using tickets.