Ticket #976 (closed enhancement: fixed)

Opened 23 months ago

Last modified 21 months ago

class redefinition after snap-reader-methods causes invalid memory operation

Reported by: fare Owned by: gb
Priority: major Milestone:
Component: Compiler Version:
Keywords: ITA Cc:

Description

I haven't been able to reduce that to a small runnable example yet, but when we compile our application with (ccl::snap-reader-methods :known-sealed-world t)

If we subsequently re-define a class, even identically, some existing methods using that class will fail with an "Invalid memory operation". That's with 1.8-r15359 LinuxX8664. Before the patch from http://trac.clozure.com/ccl/changeset/14968 was used, it was a plain crash.

Can CCL be patched so that any class-redefinition after snap-reader-methods would invalidate the snapped reader methods, and possibly issue a warning or continuable error (since a promise was broken)?

Xref: ITA bug 105557

Change History

comment:1 Changed 23 months ago by fare

  • Summary changed from class redefinition after snap-reader-methods causes to class redefinition after snap-reader-methods causes invalid memory operation

comment:2 Changed 23 months ago by gb

My (possibly incomplete) understanding of this issue is as follows.

The :KNOWN-SEALED-WORLD T argument to CCL::SNAP-READER-METHODS was intended as an assertion that the universe of CLOS classes and methods was "sealed" (that no methods/classes would be introduced or redefined and that optimizations that were otherwise impossible or impractical could exploit that fact.) It wasn't intended to be used during day-to-day development but was intended to be used in a version of applications built for delivery.

Someone at ITA read a book (or something) which advised that day-to-day development should be done in an environment that was as close as possible to the delivery environment, so the QRes build process called CCL::SNAP-READER-METHODS and asserted that the world was "sealed", but that assertion was a lie (since people did day-to-day development in that image and redefined classes and methods whenever that seemed like the right thing to do.) Not surprisingly, we started getting bug reports to the effect that changing/redefining/introducing methods and classes in what was supposed to be a "sealed world" didn't always work.

So, we tried to ensure that the optimizations performed by CCL::SNAP-READER-METHODS could be undone; I don't remember whether this undoing happened incrementally or whether it was necessary to call some specific function (CCL::PESSIMIZE-CLOS ?) to get everything back in a less-aggressively-optimized, redefinition-friendly state. From time to time we've had reports of problems with this pessimization, but AFAIK these were all fixed; the fixes generally involved ensuring that generic functions that had been optimized in some way maintained enough information in their internal state to allow the optimization to be safely undone.

ITA maintained an open bug (105557, which you cite above) involving a case where this didn't work. Matt Emerson investigated this last fall and what he saw made him suspicious that some methods which performed certain types of CLOS optimization were defined inside QRes (and not in CCL) and that whatever they did wasn't undoable. When he tried to access ITA's VPN to explore further, he found that he was no longer able to access ITA's VPN. He reported this problem but never got a reply, and we later learned that Google's acquisition of ITA lead to a change of policy which prohibited offsite access.

CCL defines a generic function called CCL::OVERRIDE-ONE-METHOD-ONE-ARG-DCODE and defines a method on this GF where all arguments are specialized to T and the body does nothing. The theory that Matt wanted to explore was that QRes defined additional method(s) on this generic function and that the optimizations performed by this method or methods weren't undoable.

If

? (ccl::generic-function-methods #'CCL::OVERRIDE-ONE-METHOD-ONE-ARG-DCODE)

returns a list with more than one method, then that theory is likely correct. (Bill St. Clair said last fall that he thought it likely that he'd added such methods while working on QRes.)

If this is indeed the cause of the problem, then it seems that the next steps would involve determining whether these methods that are part of QRes need to be part of QRes (because they access proprietary code/data etc.) If not, then moving them to CCL (or to something that we can distribute/maintain) would enable us to ensure that any optimizations that these methods perform can be undone during development.) If the methods in question need to remain part of QRes, then the next steps would seemingly involve determining how we can see and modify the code in a way that complies with Google's policies.

If this is indeed a problem that involves code that we don't have access to, there isn't likely to be anything that we can do to fix it without getting that access.

comment:3 Changed 23 months ago by fare

No, we didn't write any such optimization. The symbol OVERRIDE-ONE-METHOD-ONE-ARG-DCODE does not appear anywhere in QRes, and (ccl::generic-function-methods #'CCL::OVERRIDE-ONE-METHOD-ONE-ARG-DCODE) returns a list of just one method: {{{ (#<CCL::STANDARD-KERNEL-METHOD CCL:OVERRIDE-ONE-METHOD-ONE-ARG-DCODE (T T)>) }}}

I can run any code you desire on a copy of our servers to introspect it. So far I haven't been able to reproduce in a small example, though it's totally reproducible on the overall image. I could try harder. Maybe I need to compile-file my defclass and defmethods in separate files to reproduce the effect?

comment:4 Changed 23 months ago by gb

  • Owner set to gb
  • Status changed from new to assigned

I think that I'd find it most useful to see any crash logs or backtraces associated with this; if they might contain any proprietary information, you probably don't want to post them here of course.

comment:5 Changed 22 months ago by gz

  • Keywords ITA added
  • Priority changed from normal to major

comment:6 Changed 21 months ago by fare

With gb, we double-checked that *if* we first (ccl::pessimize-clos), then the crash doesn't happen, but if we (ccl::snap-reader-methods :known-sealed-world t) it does crash.

Proposed solution: make sure that when either defclass or defmethod is called, then a- we issue a WARNING or CERROR that some redefinition violated the premise of snap-reader-methods, and that suggests how to call snap-reader-methods to reestablish optimizations. b- call (ccl::pessimize-clos)

comment:7 Changed 21 months ago by gb

  • Status changed from assigned to closed
  • Resolution set to fixed

(In [15427]) Disable CLOS optimizations before (re-)defining classes, generic functions. Seems to fix ticket:976 in the trunk.

comment:8 Changed 21 months ago by gb

(In [15430]) Propagate r15427, r15428 to 1.8 branch. Fixes ticket:976 in 1.8.

Note: See TracTickets for help on using tickets.