Ticket #993 (closed defect: fixed)

Opened 2 years ago

Last modified 22 months ago

lock-free hashtable error

Reported by: uchida Owned by: gz
Priority: normal Milestone:
Component: other Version: 1.8
Keywords: Cc:

Description

On rare occasions, gethash returns nil dispite the existence of the entry.

toor@toor-VirtualBox:~/src/lisp$ ~/ccl/lx86cl -n
Welcome to Clozure Common Lisp Version 1.8-r15286M  (LinuxX8632)!
? (load "lock-free-hash-table-test.lisp")

#<EXTERNAL-PROCESS (/bin/echo)[9149] (RUNNING) #x1839788E> 
#<EXTERNAL-PROCESS (/bin/echo)[9152] (RUNNING) #x1835FF26> 
#<EXTERNAL-PROCESS (/bin/echo)[9154] (RUNNING) #x1841F11E> 
...
#<EXTERNAL-PROCESS (/bin/echo)[15259] (RUNNING) #x1D41E8E6> 
#<EXTERNAL-PROCESS (/bin/echo)[15261] (RUNNING) #x1D4D0536> 
#<EXTERNAL-PROCESS (/bin/echo)[15264] (RUNNING) #x1D74E746> 
> Error: value NIL is not of the expected type NUMBER.
> While executing: CCL::+-2, in process listener(1).
> Type :POP to abort, :R for a list of available restarts.
> Type :? for other options.
1 > 

Attachments

lock-free-hash-table-test.lisp Download (1.3 KB) - added by uchida 2 years ago.
lock-free-hash-table-test-2.lisp Download (1.7 KB) - added by uchida 2 years ago.
lock-free-hash-table-test-3.lisp Download (628 bytes) - added by uchida 2 years ago.
lock-free-hash-table-test-4.lisp Download (477 bytes) - added by uchida 2 years ago.
x86-constants32.h.diff Download (277 bytes) - added by uchida 2 years ago.
x86-exceptions.c.diff Download (765 bytes) - added by uchida 2 years ago.
l0-hash.lisp.diff Download (865 bytes) - added by uchida 2 years ago.
build.sh Download (274 bytes) - added by uchida 2 years ago.
debug993.c Download (176 bytes) - added by uchida 2 years ago.
debug993.lisp Download (865 bytes) - added by uchida 2 years ago.
run.sh Download (281 bytes) - added by uchida 2 years ago.
l0-hash.lisp.2.diff Download (956 bytes) - added by uchida 2 years ago.
lock-free-hash-table-test-simple.lisp Download (228 bytes) - added by uchida 23 months ago.

Change History

Changed 2 years ago by uchida

comment:1 Changed 2 years ago by uchida

Reproducible in dx86cl, dx86cl64, lx86cl and lx86cl64.

comment:2 Changed 2 years ago by gb

Also reproducible on ARM, which is a bit surprising.

Changed 2 years ago by uchida

comment:3 follow-up: ↓ 4 Changed 2 years ago by uchida

Another test program. In this, the error is now reproducible on wx86cl.exe(Win XP Pro 32bit.) On Darwin, it typically takes less than 10 seconds to reproduce the error even without the #'clrhash-thread.

comment:4 in reply to: ↑ 3 Changed 2 years ago by uchida

(932 . 89)
(932 . 90)
(932 . 91)
(932 . 92)
(932 . 93)
(932 . 94)
> Error: value NIL is not of the expected type NUMBER.
> While executing: CCL::+-2, in process listener(1).
> Type :GO to continue, :POP to abort, :R for a list of available restarts.
> If continued: Skip loading "lock-free-hash-table-test-2.lisp"
> Type :? for other options.
1 > (lisp-implementation-version)
"Version 1.8-r15286M  (WindowsX8632)"
1 >

Changed 2 years ago by uchida

comment:5 Changed 2 years ago by uchida

This is an even shorter version to reproduce a lock-free related error on dx86cl, lx86cl, wx86cl.exe.

(34 . 50)
(34 . 51)
(34 . 52)
(34 . 53)
(34 . 54)
(34 . 55)
> Error: Stack overflow on value stack.
> While executing: CCL::LOCK-FREE-REHASH, in process listener(1).
> Type :GO to continue, :POP to abort, :R for a list of available restarts.
> If continued: Delete one of the entries.
> Type :? for other options.
1 > (lisp-implementation-version)
"Version 1.8-r15286M  (WindowsX8632)"
1 >

Changed 2 years ago by uchida

comment:6 Changed 2 years ago by uchida

I managed to remove the run-program.

Changed 2 years ago by uchida

Changed 2 years ago by uchida

Changed 2 years ago by uchida

Changed 2 years ago by uchida

Changed 2 years ago by uchida

Changed 2 years ago by uchida

Changed 2 years ago by uchida

comment:7 Changed 2 years ago by uchida

All the test cases above run to completion without error by preventing GC in other threads while executing ccl::set-hash-key-conditional, and a small change in ccl::%lock-free-rehash-in-place.

To build, put all the attached files and ccl-1.8-darwinx86.tar.gz on the same directory and,

$ ./build.sh

To run the tests,

$ ./run.sh

comment:8 Changed 2 years ago by gb

  • Status changed from new to closed
  • Resolution set to fixed

(In [15444]) Because of pc_luser_xp()'s expectations/constraints, don't modify %temp0 in .SPset_hash_key_conditional. (This is similar to r15433, and doesn't fix ticket:993 either, but may have caused another problem.)

comment:9 Changed 2 years ago by gb

  • Status changed from closed to reopened
  • Resolution fixed deleted

The comment in r15444 said that ticket:993 was NOT fixed by that change.

comment:10 Changed 2 years ago by gb

(In [15447]) Register usage in x8632 version of .SPstore_node_conditional. (See ticket:993, which has exposed this and similar problems but isn't primarily attributable to them.)

comment:11 Changed 2 years ago by uchida

On x8664, it takes typically less than 10 seconds to reproduce the error by increasing the chance of GC between set-hash-key-conditional and set-hash-value-conditional in lock-free-puthash.

uchita-akiko-no-macbook-2:lispbox-0.7 uchidamasako$ ./ccl/dx86cl64 -n
Welcome to Clozure Common Lisp Version 1.8-r15286M  (DarwinX8664)!
? (advise ccl::store-gvector-conditional
	(ccl:process-allow-schedule)
	:when :before)
#<Compiled-function (CCL::ADVISED 'CCL::STORE-GVECTOR-CONDITIONAL) (Non-Global)  #x3020006F657F>
? (load "lock-free-hash-table-test.lisp")

#<EXTERNAL-PROCESS (/bin/echo)[43660] (RUNNING) #x30200072CC6D> 
#<EXTERNAL-PROCESS (/bin/echo)[43661] (RUNNING) #x30200072B3CD> 
#<EXTERNAL-PROCESS (/bin/echo)[43662] (RUNNING) #x3020006B774D> 
#<EXTERNAL-PROCESS (/bin/echo)[43663] (EXITED : 0) #x3020009FF60D> 
#<EXTERNAL-PROCESS (/bin/echo)[43664] (RUNNING) #x302000C1F62D> 
...
#<EXTERNAL-PROCESS (/bin/echo)[43749] (EXITED : 0) #x302004A2090D> 
#<EXTERNAL-PROCESS (/bin/echo)[43750] (RUNNING) #x302004BA090D> 
> Error: value NIL is not of the expected type NUMBER.
> While executing: CCL::+-2, in process listener(1).
> Type :POP to abort, :R for a list of available restarts.
> Type :? for other options.
1 > 

comment:12 Changed 2 years ago by uchida

My guess is that between set-hash-key-conditional and set-hash-value-conditional in lock-free-puthash, the partially inserted entry is in DELETED1 state. If GC happens between set-hash-key-conditional and set-hash-value-conditional and remhash or clrhash has set nhash.vector.deleted-count to 1 since the last GC, mark_root (in x86-gc.c) deletes the partially inserted entry.

Changed 2 years ago by uchida

comment:13 Changed 2 years ago by uchida

Trying to encode the partially inserted state by reversing the order of update in hash entry like l0-hash.lisp.2.diff. I'm not sure if I am right.

comment:14 Changed 2 years ago by gb

  • Owner set to gb
  • Status changed from reopened to new

comment:15 Changed 2 years ago by gb

  • Status changed from new to assigned

Changed 23 months ago by uchida

comment:16 Changed 23 months ago by uchida

Here is an even simpler test program to reproduce the error.

comment:17 Changed 22 months ago by gz

  • Owner changed from gb to gz
  • Status changed from assigned to new

comment:18 Changed 22 months ago by rme

r15504 in the trunk should fix this.

comment:19 Changed 22 months ago by uchida

Thanks! Unfortunately, lock-free-hash-table-test-3.lisp is showing another problem other than this. It seems that l0-hash.lisp.diff attached above seems to reduce the likelihood of this error.

uchita-akiko-no-macbook-2:lispbox-0.7 uchidamasako$ ./ccl-trunk/ccl/dx86cl64 -n
Welcome to Clozure Common Lisp Version 1.9-dev-r15508  (DarwinX8664)!
? (load "lock-free-hash-table-test-3.lisp")

(0 . 0) 
(0 . 1) 
(0 . 2) 
(0 . 3) 
(0 . 4) 
(0 . 5) 
...
(575 . 35) 
(575 . 36) 
(575 . 37) 
(575 . 38) 
(575 . 39) 
(575 . 40) 
(575 . 41) 
> Error: Stack overflow on value stack.
> While executing: CCL::LOCK-FREE-REHASH, in process listener(1).
> Type :GO to continue, :POP to abort, :R for a list of available restarts.
> If continued: Delete one of the entries.
> Type :? for other options.
1 > 

comment:20 Changed 22 months ago by gz

See also ticket:717.

comment:21 Changed 22 months ago by gz

  • Status changed from new to closed
  • Resolution set to fixed

(In [15525]) In lock-free-puthash handle a new key getting relocated just before it's added to the table. Fix %lock-free-rehash-in-place to set both key and value in a free slot. This fixes ticket:993.

Note: See TracTickets for help on using tickets.