BUG/MEDIUM: httpclient/lua: fix a race between lua GC and hlua_ctx_destroy
In bb581423b ("BUG/MEDIUM: httpclient/lua: crash when the lua task timeout
before the httpclient"), a new logic was implemented to make sure that
when a lua ctx destroyed, related httpclients are correctly destroyed too
to prevent a such httpclients from being resuscitated on a destroyed lua ctx.
This was implemented by adding a list of httpclients within the lua ctx,
and a new function, hlua_httpclient_destroy_all(), that is called under
hlua_ctx_destroy() and runs through the httpclients list in the lua context
to properly terminate them.
This was done with the assumption that no concurrent Lua garbage collection
cycles could occur on the same ressources, which seems OK since the "lua"
context is about to be freed and is not explicitly being used by other threads.
But when 'lua-load' is used, the main lua stack is shared between multiple
OS threads, which means that all lua ctx in the process are linked to the
same parent stack.
Yet it seems that lua GC, which can be triggered automatically under
lua_resume() or manually through lua_gc(), does not limit itself to the
"coroutine" stack (the stack referenced in lua->T) when performing the cleanup,
but is able to perform some cleanup on the main stack plus coroutines stacks
that were created under the same main stack (via lua_newthread()) as well.
This can be explained by the fact that lua_newthread() coroutines are not meant
to be thread-safe by design.
Source: http://lua-users.org/lists/lua-l/2011-07/msg00072.html (lua co-author)
It did not cause other issues so far because most of the time when using
'lua-load', the global lua lock is taken when performing critical operations
that are known to interfere with the main stack.
But here in hlua_httpclient_destroy_all(), we don't run under the global lock.
Now that we properly understand the issue, the fix is pretty trivial:
We could simply guard the hlua_httpclient_destroy_all() under the global
lua lock, this would work but it could increase the contention over the
global lock.
Instead, we switched 'lua->hc_list' which was introduced with bb581423b
from simple list to mt_list so that concurrent accesses between
hlua_httpclient_destroy_all and hlua_httpclient_gc() are properly handled.
The issue was reported by @Mark11122 on Github #2037.
This must be backported with bb581423b ("BUG/MEDIUM: httpclient/lua: crash
when the lua task timeout before the httpclient") as far as 2.5.
diff --git a/src/hlua.c b/src/hlua.c
index 184c1e7..3c19680 100644
--- a/src/hlua.c
+++ b/src/hlua.c
@@ -1248,7 +1248,7 @@
lua->wake_time = TICK_ETERNITY;
lua->state_id = state_id;
LIST_INIT(&lua->com);
- LIST_INIT(&lua->hc_list);
+ MT_LIST_INIT(&lua->hc_list);
if (!already_safe) {
if (!SET_SAFE_LJMP_PARENT(lua)) {
lua->Tref = LUA_REFNIL;
@@ -1270,16 +1270,30 @@
return 1;
}
-/* kill all associated httpclient to this hlua task */
+/* kill all associated httpclient to this hlua task
+ * We must take extra precautions as we're manipulating lua-exposed
+ * objects without the main lua lock.
+ */
static void hlua_httpclient_destroy_all(struct hlua *hlua)
{
- struct hlua_httpclient *hlua_hc, *back;
+ struct hlua_httpclient *hlua_hc;
- list_for_each_entry_safe(hlua_hc, back, &hlua->hc_list, by_hlua) {
- if (hlua_hc->hc)
- httpclient_stop_and_destroy(hlua_hc->hc);
+ /* use thread-safe accessors for hc_list since GC cycle initiated by
+ * another thread sharing the same main lua stack (lua coroutine)
+ * could execute hlua_httpclient_gc() on the hlua->hc_list items
+ * in parallel: Lua GC applies on the main stack, it is not limited to
+ * a single coroutine stack, see Github issue #2037 for reference.
+ * Remember, coroutines created using lua_newthread() are not meant to
+ * be thread safe in Lua. (From lua co-author:
+ * http://lua-users.org/lists/lua-l/2011-07/msg00072.html)
+ *
+ * This security measure is superfluous when 'lua-load-per-thread' is used
+ * since in this case coroutines exclusively run on the same thread
+ * (main stack is not shared between OS threads).
+ */
+ while ((hlua_hc = MT_LIST_POP(&hlua->hc_list, typeof(hlua_hc), by_hlua))) {
+ httpclient_stop_and_destroy(hlua_hc->hc);
hlua_hc->hc = NULL;
- LIST_DEL_INIT(&hlua_hc->by_hlua);
}
}
@@ -7052,11 +7066,11 @@
hlua_hc = MAY_LJMP(hlua_checkhttpclient(L, 1));
- if (hlua_hc->hc)
+ if (MT_LIST_DELETE(&hlua_hc->by_hlua)) {
+ /* we won the race against hlua_httpclient_destroy_all() */
httpclient_stop_and_destroy(hlua_hc->hc);
-
- hlua_hc->hc = NULL;
- LIST_DEL_INIT(&hlua_hc->by_hlua);
+ hlua_hc->hc = NULL;
+ }
return 0;
}
@@ -7087,7 +7101,7 @@
if (!hlua_hc->hc)
goto err;
- LIST_APPEND(&hlua->hc_list, &hlua_hc->by_hlua);
+ MT_LIST_APPEND(&hlua->hc_list, &hlua_hc->by_hlua);
/* Pop a class stream metatable and affect it to the userdata. */
lua_rawgeti(L, LUA_REGISTRYINDEX, class_httpclient_ref);