PySide Bugzilla Closed for New Bugs

PySide is now a Qt Add-on and uses the Qt Project's JIRA Bug Tracker instead of this Bugzilla instance. This Bugzilla is left for reference purposes.

Bug 1110 - Concurrency error causes GC heap corruption
: Concurrency error causes GC heap corruption
Status: RESOLVED FIXED
Product: PySide
Classification: Unclassified
Component: PySide
: HEAD
: All All
: P2 normal
Assigned To: Hugo Parente Lima
:
:
:
  Show dependency treegraph
 
Reported: 2012-01-04 01:41 EET by Jason McCampbell
Modified: 2012-02-28 00:46 EET (History)
10 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Jason McCampbell 2012-01-04 01:41:57 EET
We are seeing random hangs and crashes at startup about 5% of the time (PySide
1.0.9, Python 2.7.2).  There are two symptoms that appear to be related.

One is that the application hangs in the GC module in update_refs. The issue is
that gc->gc.gc_next points to itself.  What is happening is that a prior GC
object becomes corrupted and points to the head of generation 0, even though no
objects are in that list.  In the gdb transcript below you can see gc_next ==
gc_prev == gc == 0x16f3e0 == &generations[0].  Note that generation 1 is being
collected but the list points back to generation 0 head, which causes the
infinite loop.

The second manifestation is a crash with a similar stack trace, always
occurring during a collection step.  

After instrumenting Python's GC module to catch concurrency errors it seems
that 
both of these issues occur because DynamicSlotDataV2::callback() and
GlobalReceiverV2::qt_metacall() don't acquire the GIL before calling into
Python.

In our case we have the main thread running while a QWebView instance is
running additional threads and making calls into the Python interpreter.  It is
the QWebView thread that isn't acquiring the GIL.

I am testing a patch and will submit it tomorrow once I'm reasonably confident
that it doesn't break anything else.

Jason

gdb transcript:

Reading symbols for shared libraries + done
Reading symbols for shared libraries + done
Reading symbols for shared libraries + done
Reading symbols for shared libraries + done
^C
Program received signal SIGINT, Interrupt.
update_refs (containers=0x16f400) at Modules/gcmodule.c:291
291        for (; gc != containers; gc = gc->gc.gc_next) {
(gdb) where
#0  update_refs (containers=0x16f400) at Modules/gcmodule.c:291
#1  0x0010d0e8 in collect (generation=1) at Modules/gcmodule.c:874
#2  0x0010d3e2 in collect_generations () at Modules/gcmodule.c:997
#3  0x0010e067 in _PyObject_GC_Malloc (basicsize=28) at Modules/gcmodule.c:1513
#4  0x0007135f in PyType_GenericAlloc (type=0x159140, nitems=0) at
Objects/typeobject.c:753
#5  0x000804ac in weakref___new__ (type=0x159140, args=0x1a4e24c0, kwargs=0x0)
at Objects/weakrefobject.c:300
#6  0x0007120a in type_call (type=0x159140, args=0x1a4e24c0, kwds=0x0) at
Objects/typeobject.c:721
#7  0x0000d4a4 in PyObject_Call (func=0x159140, arg=0x1a4e24c0, kw=0x0) at
Objects/abstract.c:2529
#8  0x000ce34b in do_call (func=0x159140, pp_stack=0xbfffa310, na=1, nk=0) at
Python/ceval.c:4231
...
#131 0x000f278f in PyRun_SimpleFileExFlags (fp=0xa0be08e0, filename=0xbffff816
"/Users/jmccampbell/Library/EPD/7.2-i386/bin/canopy", closeit=1,
flags=0xbffff638) at Python/pythonrun.c:936
#132 0x000f204c in PyRun_AnyFileExFlags (fp=0xa0be08e0, filename=0xbffff816
"/Users/jmccampbell/Library/EPD/7.2-i386/bin/canopy", closeit=1,
flags=0xbffff638) at Python/pythonrun.c:740
#133 0x0010c2cd in Py_Main (argc=3, argv=0xbffff71c) at Modules/main.c:599
#134 0x000024a6 in main (argc=3, argv=0xbffff71c) at ./Modules/python.c:23
(gdb) p *gc
$1 = {
  gc = {
    gc_next = 0x16f3e0, 
    gc_prev = 0x16f3e0, 
    gc_refs = 700, 
    gc_magic = 0
  }, 
  dummy = <invalid float value>
}
Comment 1 Hugo Parente Lima 2012-01-04 16:08:45 EET
Thanks for going deep trying to find a solution for this bug, a patch will be
very welcome.

You can run all unit test to check if your patch breaks anything, besides
creating another test to check if your patch fixes this bug, we will test your
patch against our build bots anyway to check if it breaks PySide on any other
supported platform.
Comment 2 Jason McCampbell 2012-01-04 20:05:31 EET
I submitted a pull request on GitHub
(https://github.com/PySide/PySide/pull/109) that fixes this specific issue and
what appears to be some other cases where the GIL is not acquired.

I ran our regression tests and used our application to verify that it doesn't
break anything.  I did not run the PySide unit tests as I couldn't find any
docs on them and was seeing several segmentation violations with and without my
changes.  I'm not sure if there is anything specific that needs to be done to
run them.
Comment 3 David Butler 2012-01-05 06:38:18 EET
*** Bug 1099 has been marked as a duplicate of this bug. ***
Comment 4 Paulo Alcantara 2012-01-05 22:48:34 EET
Hi Jason,

Is there any way you could paste/attach your code that reproduces this bug ?
(That would presumably make our life easier)

I've tested your branch and it seems not to break anything else. Excellent! :-)
Comment 5 Hugo Parente Lima 2012-01-05 23:04:23 EET
In other words, have you some PySide code that crashes without your patches? If
so we can transform it into a unit test and avoid future regressions, besides
make sure the patches fix anything.
Comment 6 Jason McCampbell 2012-01-05 23:41:39 EET
Hi Hugo, unfortunately I haven't been able to find a way to reproduce it
efficiently.  What I was seeing is that if I started the application one out of
10 or 20 tries it would crash during startup. 

Here is what I can tell you... the issue occurred at startup because our main
window includes a QWebView widget that was making a call to
PySide::GlobalReceiverV2::qt_metacall on a thread other than the main Python
thread.  See the stack trace pasted below.  qt_metacall eventually calls
PyMethod_New and wasn't acquiring the GIL so once in a while it would conflict
with another thread holding the GIL and corrupt the heap.

I found the issue by modifying the Python GC module and adding a lock check
that causes an abort if another thread already has the lock.  I would actually
like to see a compile option in Python that turns on checks for all appropriate
API functions verifying that the GIL is actually held by that thread.

Short of that, the best option I can think of is spawn two threads and have one
of them call one of the functions in this stack over-and-over looking for
thread timing issues.  But that sort of test is rather unsatisfying.

Jason

(gdb) where
#0  PySide::GlobalReceiverV2::qt_metacall (this=0x1da98130,
call=QMetaObject::InvokeMetaMethod, id=5, args=0xbfff98e8) at
/Users/jmccampbell/packages/PySide/pyside-qt4/libpyside/globalreceiverv2.cpp:292
#1  0x02e75b81 in QMetaObject::activate ()
#2  0x1e8551d4 in QWebPage::loadFinished ()
#3  0x1e85fe1e in QWebPage::qt_metacall ()
#4  0x02e75b81 in QMetaObject::activate ()
#5  0x1e83c124 in WebCore::FrameLoaderClientQt::loadFinished ()
#6  0x1e83cd89 in
WebCore::FrameLoaderClientQt::postProgressFinishedNotification ()
#7  0x1e61331b in WebCore::ProgressTracker::finalProgressComplete ()
#8  0x1e6133e8 in WebCore::ProgressTracker::progressCompleted ()
#9  0x1e5e013b in WebCore::FrameLoader::clearProvisionalLoad ()
#10 0x1e5e1efa in WebCore::FrameLoader::checkLoadCompleteForThisFrame ()
#11 0x1e5e2236 in WebCore::FrameLoader::recursiveCheckLoadComplete ()
#12 0x1e5e48c7 in WebCore::FrameLoader::receivedMainResourceError ()
#13 0x1e605b3e in WebCore::MainResourceLoader::didCancel ()
Comment 7 Hugo Parente Lima 2012-02-28 00:46:52 EET
Fix finally merged in commit:

pyside/4e1be8c119c7611e1fdce08f670d63d673f04e0f