Skip to main content

Notice: this Wiki will be going read only early in 2024 and edits will no longer be possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

Jump to: navigation, search

EclipseLink/Bugs/259993

< EclipseLink‎ | Bugs
Revision as of 10:11, 21 January 2009 by Michael.obrien.oracle.com (Talk | contribs) (Reproduction)

Bug Analysis Document: 259993: Deadlock on em.find() on WebSphere 7.0.0.1

Document History

Date Author Version Description & Notes
20080115 Michael O'Brien 1.0 Initial reproduction use cases

Overview

This bug describes the behavior and fix for an isssue where an em.find() on WebSphere 7.0.0.1 hangs indefinitly when attempting to acquire lock on the concurrencyManager on a cacheKey.

We have two (2) deadlock issues here in package org.eclipse.persistence.testing.tests.jpa.relationship.

  • RelationshipModelJUnitTestSuite.testModifyItem()
  • VirtualAttributeTestSuite.testInsertVirtualAttribute()

Exceptions

There is no exception until the tests are stopped or the server is restarted. We then get a deadlock timeout exception. If this initial deadlock is avoided in code we experience a secondary deadlock in ConcurrencyManager.

Timeout during UnitOfWorkIdentityMapAccessor.getAndCloneCacheKeyFromParent()

[1/15/09 17:48:15:936 EST] 0000000e ThreadMonitor W   WSVR0605W: Thread "ORB.thread.pool : 1" (00000022) has been active for 670586 milliseconds and may be hung.  There is/are 2 thread(s) in total in the server that may be hung.
	at java.lang.Object.wait(Native Method)
	at java.lang.Object.wait(Object.java:196)
	at org.eclipse.persistence.internal.sessions.UnitOfWorkIdentityMapAccessor.getAndCloneCacheKeyFromParent(UnitOfWorkIdentityMapAccessor.java:146)

After adding code to break out of the wait() loop. We probably get this serialization error because even though both our test client (JUNit) and the EJB 3.0 jar model on the server reside on the same machine - they live in different JVM processes.

    [junit]     org.omg.CORBA.MARSHAL: Unable to read value from underlying
bridge : Mismatched serialization UIDs : Source (RepId
RMI:junit.framework.AssertionFailedError:2E921E9DEB124A7F:0000000000000001) 
= 0000000000000001 whereas Target (RepId
RMI:junit.framework.AssertionFailedError:048A3675A6FD5463:A02206A4C2A2FD62)
= A02206A4C2A2FD62  vmcid: IBM  minor code: 896  completed: No)

Timeout during ConcurrencyManager.releaseDeferredLock()

1/19/09 16:17:33:719 EST] 00000019 TimeoutManage I   WTRN0006W: Transaction
0000011EF0BF93D7000000010000000F3FC544039A0DB96F61B909BB432079E1295E672A0000011EF0BF93D7000000010000000F3FC544039A0DB96F61B909BB432079E1295E672A00000001
has timed out after 120 seconds.
[1/19/09 16:17:33:719 EST] 00000019 TimeoutManage I   WTRN0124I: When the timeout occurred the thread with which the transaction is, or was most recently, associated was Thread[ORB.thread.pool : 0,5,main]. The stack trace of this thread when the timeout occurred was: 
	java.lang.Thread.sleep(Native Method)
	java.lang.Thread.sleep(Thread.java:850)
	org.eclipse.persistence.internal.helper.ConcurrencyManager.releaseDeferredLock(ConcurrencyManager.java:429)

StackTrace

Query: DataModifyQuery(sql="CREATE TABLE CMP3_ENTITYD_SEQ (SEQ_NAME VARCHAR2(50) NOT NULL, SEQ_COUNT NUMBER(38) NULL, PRIMARY KEY (SEQ_NAME))")
[1/15/09 17:48:15:936 EST] 0000000e ThreadMonitor W   WSVR0605W: Thread "ORB.thread.pool : 1" (00000022) has been active for 670586 milliseconds and may be hung.  There is/are 2 thread(s) in total in the server that may be hung.
	at java.lang.Object.wait(Native Method)
	at java.lang.Object.wait(Object.java:196)
	at org.eclipse.persistence.internal.sessions.UnitOfWorkIdentityMapAccessor.getAndCloneCacheKeyFromParent(UnitOfWorkIdentityMapAccessor.java:146)
	at org.eclipse.persistence.internal.sessions.UnitOfWorkIdentityMapAccessor.getFromIdentityMap(UnitOfWorkIdentityMapAccessor.java:110)
	at org.eclipse.persistence.internal.queries.ExpressionQueryMechanism.checkCacheForObject(ExpressionQueryMechanism.java:730)
	at org.eclipse.persistence.queries.ReadObjectQuery.checkEarlyReturnImpl(ReadObjectQuery.java:229)
	at org.eclipse.persistence.queries.ObjectLevelReadQuery.checkEarlyReturn(ObjectLevelReadQuery.java:627)
	at org.eclipse.persistence.queries.DatabaseQuery.execute(DatabaseQuery.java:613)
	at org.eclipse.persistence.queries.ObjectLevelReadQuery.execute(ObjectLevelReadQuery.java:891)
	at org.eclipse.persistence.queries.ReadObjectQuery.execute(ReadObjectQuery.java:397)
	at org.eclipse.persistence.queries.ObjectLevelReadQuery.executeInUnitOfWork(ObjectLevelReadQuery.java:954)
	at org.eclipse.persistence.internal.sessions.UnitOfWorkImpl.internalExecuteQuery(UnitOfWorkImpl.java:2651)
	at org.eclipse.persistence.internal.sessions.AbstractSession.executeQuery(AbstractSession.java:1181)
	at org.eclipse.persistence.internal.sessions.AbstractSession.executeQuery(AbstractSession.java:1165)
	at org.eclipse.persistence.internal.sessions.AbstractSession.executeQuery(AbstractSession.java:1125)
	at org.eclipse.persistence.internal.jpa.EntityManagerImpl.executeQuery(EntityManagerImpl.java:571)
	at org.eclipse.persistence.internal.jpa.EntityManagerImpl.findInternal(EntityManagerImpl.java:516)
	at org.eclipse.persistence.internal.jpa.EntityManagerImpl.find(EntityManagerImpl.java:431)
	at org.eclipse.persistence.internal.jpa.EntityManagerImpl.find(EntityManagerImpl.java:339)
	at com.ibm.ws.jpa.management.JPATxEmInvocation.find(JPATxEmInvocation.java:211)
	at com.ibm.ws.jpa.management.JPAEntityManager.find(JPAEntityManager.java:175)
	at org.eclipse.persistence.testing.tests.jpa.relationships.VirtualAttributeTestSuite.testReadVirtualAttribute(VirtualAttributeTestSuite.java:76)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:45)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:37)
	at java.lang.reflect.Method.invoke(Method.java:599)
	at junit.framework.TestCase.runTest(TestCase.java:154)
	at junit.framework.TestCase.runBare(TestCase.java:127)
	at org.eclipse.persistence.testing.framework.junit.JUnitTestCase.runBareServer(JUnitTestCase.java:391)
	at org.eclipse.persistence.testing.framework.server.TestRunnerBean.runTest(TestRunnerBean.java:87)
	at org.eclipse.persistence.testing.framework.server.EJSRemote0SLTestRunner_ed643271.runTest(EJSRemote0SLTestRunner_ed643271.java)
	at org.eclipse.persistence.testing.framework.server._EJSRemote0SLTestRunner_ed643271_Tie.runTest(_EJSRemote0SLTestRunner_ed643271_Tie.java)
	at org.eclipse.persistence.testing.framework.server._EJSRemote0SLTestRunner_ed643271_Tie._invoke(_EJSRemote0SLTestRunner_ed643271_Tie.java)
	at com.ibm.CORBA.iiop.ServerDelegate.dispatchInvokeHandler(ServerDelegate.java:622)
	at com.ibm.CORBA.iiop.ServerDelegate.dispatch(ServerDelegate.java:475)
	at com.ibm.rmi.iiop.ORB.process(ORB.java:504)
	at com.ibm.CORBA.iiop.ORB.process(ORB.java:1571)
	at com.ibm.rmi.iiop.Connection.respondTo(Connection.java:2771)
	at com.ibm.rmi.iiop.Connection.doWork(Connection.java:2640)
	at com.ibm.rmi.iiop.WorkUnitImpl.doWork(WorkUnitImpl.java:63)
	at com.ibm.ejs.oa.pool.PooledThread.run(ThreadPool.java:118)
	at com.ibm.ws.util.ThreadPool$Worker.run(ThreadPool.java:1527)

Concepts

JPA Specification Notes

JPA 1.0 Specification

JPA 2.0 Specification

Reproduction

We need to narrow down in what cases this deadlock occurs and in what configuration.

WebSphere Behavior

We have the following use-case points to verify. Installation with/without admin security, installation as a service/non-service, running in debug/non-debug,

Note: It helps that I have 2 WebSphere 7 servers running because the first install was with security settings on which causes a UI security popup on all config/deploy/stop actions with a 30 sec timeout.

  • With admin security =
  • Without admin security =
  • Install as a service =
  • Install as a non-service =
  • Run server in non-debug mode =
  • Run server in debug mode =
    • With Eclipse debug client attached
    • Without Eclipse debug client attached

entityManager.find() Behaviour

  • em.find() after a previous insert
  • em.find() in another update


Debugging EclipseLink in WebSphere 7 from Eclipse

See procedure on setting up WebSphere 7 so that we can debug the JUnit server tests from the Eclipse IDE by attaching to the 7781 JDWP debug port of a debug mode WebSphere server instance.

http://wiki.eclipse.org/EclipseLink/Examples/JPA/WebSphere7_Web_Tutorial#Remote_Debugging_EclipseLink_in_WebSphere_7.0_from_Eclipse_3.4

Prerequisites

Data Model

Use Case 1:

StackTrace

Logs

Analysis

Before debugging this issue it looks like the following code is never returning from the while loop because either the lock was not released or the cached object is still null.

package org.eclipse.persistence.internal.sessions;
public class UnitOfWorkIdentityMapAccessor extends IdentityMapAccessor {
    /**
     * INTERNAL:
     * This method will return the object from the parent and clone it.
     */
    protected Object getAndCloneCacheKeyFromParent(Vector primaryKey, Class theClass, boolean shouldReturnInvalidatedObjects, ClassDescriptor descriptor) {
        // Note: Objects returned from the parent's identity map should include invalidated
        // objects. This is important because this internal method is used in the existence
        // check in the UnitOfWork.
        UnitOfWorkImpl unitOfWork = (UnitOfWorkImpl)getSession();
        org.eclipse.persistence.internal.sessions.IdentityMapAccessor parentIdentityMapAccessor = unitOfWork.getParent().getIdentityMapAccessorInstance();
        CacheKey cacheKey = parentIdentityMapAccessor.getCacheKeyForObject(primaryKey, theClass, descriptor);
        if ((cacheKey == null) && unitOfWork.getParent().isUnitOfWork()) {
            //for nested unit of work
            //make parent clone and register object
            ((UnitOfWorkIdentityMapAccessor)parentIdentityMapAccessor).getAndCloneCacheKeyFromParent(primaryKey, theClass, shouldReturnInvalidatedObjects, descriptor);
            //get the cachekey that was created in the parent.
            cacheKey = parentIdentityMapAccessor.getCacheKeyForObject(primaryKey, theClass, descriptor);
        }
 
        Object objectFromCache = null;
        // this check could be simplfied to one line but would create a window
        // in which GC could remove the object and we would end up with a null pointer
        // as well we must inspect the cacheKey without locking on it.
        if ((cacheKey != null) && (shouldReturnInvalidatedObjects || !descriptor.getCacheInvalidationPolicy().isInvalidated(cacheKey))) {
            synchronized (cacheKey.getMutex()) {
                //if the object in the cachekey is null but the key is acquired then
                //someone must be rebuilding it or creating a new one.  Sleep until
                // it's finished. A plain wait here would be more efficient but we may not
                // get notified for quite some time (ie deadlock) if the other thread
                //is building the object.  Must wait and not sleep in order for the monitor to be released
                objectFromCache = cacheKey.getObject();
                try {
                    while (cacheKey.isAcquired() && (objectFromCache == null)) {
// WebSphere 7.0.0.1 hangs indefinetly here
                        cacheKey.getMutex().wait(5); // in milliseconds
                    }
                } catch (InterruptedException ex) {
                }
                if (objectFromCache == null) {
                    return null;
                }
            }
        } else {
            return null;
        }
 
        // Consider read-only class CR#4094
        if (getSession().isClassReadOnly(theClass, descriptor)) {
            // PERF: Just return the original object.
            return objectFromCache;
        }
 
        if(getSession() instanceof RepeatableWriteUnitOfWork ) {
            Object unregisteredDeletedClone = ((RepeatableWriteUnitOfWork)getSession()).getUnregisteredDeletedCloneForOriginal(objectFromCache);
            if(unregisteredDeletedClone != null) {
                return unregisteredDeletedClone;
            }
        }
 
        return unitOfWork.cloneAndRegisterObject(objectFromCache, cacheKey, descriptor);
    }

Constraints

Concurrency and Thread Safety

Design / Functionality

Alternative 1: Short Fix: Do Not Wait Indefinitely

This change will affect all platforms. However it does not fix the core of the problem where we are getting a mismatched serialization UID for the failing - previously hung tests.

The fix essentially involves limiting the number of wait cycles when holding for the mutex on the key to release - as opposed to waiting for the mutex forever.

Part 1: UnitOfWorkIdentityMapAccessor

Source
              private static final int MAXIMUM_MUTEX_STILL_ACQUIRED_WAIT_ITERATIONS = 2000;
              private static final int MUTEX_STILL_ACQUIRED_WAIT_MILLISECONDS_TIME = 5;
 
               objectFromCache = cacheKey.getObject();
                try {
--->                int iterations = 0;
                    while (cacheKey.isAcquired() && (objectFromCache == null) 
--->                    && (iterations++ < MAXIMUM_MUTEX_STILL_ACQUIRED_WAIT_ITERATIONS)) {
                        cacheKey.getMutex().wait(MUTEX_STILL_ACQUIRED_WAIT_MILLISECONDS_TIME);
                    }
                } catch (InterruptedException ex) {
                  // we should log a finest message here
                }

Part 2: ConcurrencyManager

The alt #1 workaround allows the tests to continue and emit the base server exception below:

    [junit]     org.omg.CORBA.MARSHAL: Unable to read value from underlying
bridge : Mismatched serialization UIDs : Source (RepId
RMI:junit.framework.AssertionFailedError:2E921E9DEB124A7F:0000000000000001) 
= 0000000000000001 whereas Target (RepId
RMI:junit.framework.AssertionFailedError:048A3675A6FD5463:A02206A4C2A2FD62)
= A02206A4C2A2FD62  vmcid: IBM  minor code: 896  completed: No)

There are (2) two nested deadlocks in our code. The second deadlock occurs periodically after the first has been fixed during testModifyItem() in ConcurrencyManager.releasesDeferredLock()

Stacktrace
1/19/09 16:17:33:719 EST] 00000019 TimeoutManage I   WTRN0006W: Transaction
0000011EF0BF93D7000000010000000F3FC544039A0DB96F61B909BB432079E1295E672A0000011EF0BF93D7000000010000000F3FC544039A0DB96F61B909BB432079E1295E672A00000001
has timed out after 120 seconds.
[1/19/09 16:17:33:719 EST] 00000019 TimeoutManage I   WTRN0124I: When the timeout occurred the thread with which the transaction is, or was most recently, associated was Thread[ORB.thread.pool : 0,5,main]. The stack trace of this thread when the timeout occurred was: 
	java.lang.Thread.sleep(Native Method)
	java.lang.Thread.sleep(Thread.java:850)
	org.eclipse.persistence.internal.helper.ConcurrencyManager.releaseDeferredLock(ConcurrencyManager.java:429)
	org.eclipse.persistence.internal.helper.ConcurrencyManager.checkDeferredLock(ConcurrencyManager.java:193)
	org.eclipse.persistence.internal.identitymaps.CacheKey.checkDeferredLock(CacheKey.java:176)
	org.eclipse.persistence.internal.identitymaps.IdentityMapManager.getFromIdentityMapWithDeferredLock(IdentityMapManager.java:686)
	org.eclipse.persistence.internal.sessions.IdentityMapAccessor.getFromIdentityMapWithDeferredLock(IdentityMapAccessor.java:424)
	org.eclipse.persistence.internal.queries.ExpressionQueryMechanism.checkCacheForObject(ExpressionQueryMechanism.java:728)
	org.eclipse.persistence.queries.ReadObjectQuery.checkEarlyReturnImpl(ReadObjectQuery.java:229)
	org.eclipse.persistence.queries.ObjectLevelReadQuery.checkEarlyReturn(ObjectLevelReadQuery.java:620)
	org.eclipse.persistence.queries.ObjectLevelReadQuery.checkEarlyReturn(ObjectLevelReadQuery.java:644)
	org.eclipse.persistence.queries.DatabaseQuery.execute(DatabaseQuery.java:613)
	org.eclipse.persistence.queries.ObjectLevelReadQuery.execute(ObjectLevelReadQuery.java:891)
	org.eclipse.persistence.queries.ReadObjectQuery.execute(ReadObjectQuery.java:397)
	org.eclipse.persistence.queries.ObjectLevelReadQuery.executeInUnitOfWork(ObjectLevelReadQuery.java:954)
	org.eclipse.persistence.internal.sessions.UnitOfWorkImpl.internalExecuteQuery(UnitOfWorkImpl.java:2651)
	org.eclipse.persistence.internal.sessions.AbstractSession.executeQuery(AbstractSession.java:1181)
	org.eclipse.persistence.internal.sessions.AbstractSession.executeQuery(AbstractSession.java:1165)
	org.eclipse.persistence.internal.sessions.AbstractSession.executeQuery(AbstractSession.java:1125)
	org.eclipse.persistence.internal.jpa.EntityManagerImpl.executeQuery(EntityManagerImpl.java:571)
	org.eclipse.persistence.internal.jpa.EntityManagerImpl.findInternal(EntityManagerImpl.java:516)
	org.eclipse.persistence.internal.jpa.EntityManagerImpl.find(EntityManagerImpl.java:431)
	org.eclipse.persistence.internal.jpa.EntityManagerImpl.find(EntityManagerImpl.java:339)
	com.ibm.ws.jpa.management.JPATxEmInvocation.find(JPATxEmInvocation.java:211)
	com.ibm.ws.jpa.management.JPAEntityManager.find(JPAEntityManager.java:175)
	org.eclipse.persistence.testing.tests.jpa.relationships.RelationshipModelJUnitTestSuite.testModifyItem(RelationshipModelJUnitTestSuite.java:119)
	sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:45)
	sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:37)
	java.lang.reflect.Method.invoke(Method.java:599)
	junit.framework.TestCase.runTest(TestCase.java:154)
	junit.framework.TestCase.runBare(TestCase.java:127)
	org.eclipse.persistence.testing.framework.junit.JUnitTestCase.runBareServer(JUnitTestCase.java:391)
	org.eclipse.persistence.testing.framework.server.TestRunnerBean.runTest(TestRunnerBean.java:87)
	org.eclipse.persistence.testing.framework.server.EJSRemote0SLTestRunner_ed643271.runTest(EJSRemote0SLTestRunner_ed643271.java)
	org.eclipse.persistence.testing.framework.server._EJSRemote0SLTestRunner_ed643271_Tie.runTest(_EJSRemote0SLTestRunner_ed643271_Tie.java)
	org.eclipse.persistence.testing.framework.server._EJSRemote0SLTestRunner_ed643271_Tie._invoke(_EJSRemote0SLTestRunner_ed643271_Tie.java)
	com.ibm.CORBA.iiop.ServerDelegate.dispatchInvokeHandler(ServerDelegate.java:622)
        while (true) {
            // 2612538 - the default size of Map (32) is appropriate
            Map recursiveSet = new IdentityHashMap();
            if (isBuildObjectOnThreadComplete(currentThread, recursiveSet)) {// Thread job done.
                lockManager.releaseActiveLocksOnThread();
                removeDeferredLockManager(currentThread);
                AbstractSessionLog.getLog().log(SessionLog.FINER, "deferred_locks_released", currentThread.getName());
                return;
            } else {// Not done yet, wait and check again.
                try {
---->                Thread.sleep(1);
                } catch (InterruptedException ignoreAndContinue) {
                }
            }
        }

Logs

server-run-test:
     [echo] Running test eclipselink-relationships-model on websphere
Warning: Reference compile.server.path has not been set at runtime, but was found during
build file parsing, attempting to resolve. Future versions of Ant may support
 referencing ids defined in non-executed targets.
Warning: Reference compile.path has not been set at runtime, but was found during
build file parsing, attempting to resolve. Future versions of Ant may support
 referencing ids defined in non-executed targets.
    [junit] Running org.eclipse.persistence.testing.tests.jpa.relationships.ServerTestSuite
    [junit]     org.omg.CORBA.MARSHAL: Unable to read value from underlying bridge : Mismatched serialization UIDs : Source (RepId RMI:junit.framework.AssertionFailedError:2E921E9DEB124A7F:0000000000000001) = 0000000000000001 whereas Target
 (RepId RMI:junit.framework.AssertionFailedError:048A3675A6FD5463:A02206A4C2A2FD62) = A02206A4C2A2FD62  vmcid: IBM  minor code: 896  completed: No)
    [junit]     org.omg.CORBA.MARSHAL: Unable to read value from underlying bridge : Mismatched serialization UIDs : Source (RepId RMI:junit.framework.AssertionFailedError:2E921E9DEB124A7F:0000000000000001) = 0000000000000001 whereas Target
 (RepId RMI:junit.framework.AssertionFailedError:048A3675A6FD5463:A02206A4C2A2FD62) = A02206A4C2A2FD62  vmcid: IBM  minor code: 896  completed: No)
    [junit] Tests run: 30, Failures: 0, Errors: 2, Time elapsed: 69.799 sec

BUILD SUCCESSFUL
Total time: 3 minutes 44 seconds

Alternative 2: Long Fix: WebSphere 7.0 Platform Change for Alternative 1

This change will target the change in alternative 1 to WebSphere 7 only. I don't think we should limit these two deadlock checks to a single application server's Platform class.

Source

Alternative 3: Full Fix:

What is the cause of the deadlock for the following two specific test cases on WebSphere 7?

package org.eclipse.persistence.testing.tests.jpa.relationship;
 
RelationshipModelJUnitTestSuite.testModifyItem()
   Item item = em.find(Item.class, itemId);
VirtualAttributeTestSuite.testInsertVirtualAttribute()


Source

 

Decision

Pending code-review.

Logs

Implementation

Logging

Testing

Full regression tests on another JEE server like WebLogic will be required. Full regression tests on an SE (servlet only) server like Tomcat will be required.

API

GUI

Config files

Documentation

Open Issues

Issue # Owner Description / Notes
I1 mobrien - If we go with alt#1 - we need a number for the maximum timeout wait (1/10/30 seconds)
I2 mobrien -

Decisions

Issue # Description / Notes Decision

Future Considerations

During the research for this bug, the following items were identified as out of scope but are captured here as potential future enhancements. If agreed upon during the review process these should be logged in the bug system.

References

Back to the top