(partially) Fix #91 using a simple Alarm (SIGNAL) when exec-timeout

Introducing a timer (in this case 5 seconds) to ensure that the
execution time of the tokenizer takes less than 5 seconds. This
is a simple and standard POSIX signal handler.

This approach fixes the specific issues we have currently
with some inputs where the tokenization takes too much time. This
fix should be improved and be more generic:

 - Introducing statistics of content which timeouts.
 - Keeping a list/queue to further process those files using a different
   tokenizer approach. Maybe a set of "dirty" processes to handle the edge cases
   and to not impact the overall processing and analysis.
 - Make the timer configurable per module (at least for this one).
This commit is contained in:
Alexandre Dulaunoy 2017-01-12 07:32:55 +00:00
parent 1950a2dc0e
commit 3b101ea8f5

View file

@ -28,6 +28,15 @@ from packages import Paste
from pubsublogger import publisher
from Helper import Process
import signal
class TimeoutException(Exception):
pass
def timeout_handler(signum, frame):
raise TimeoutException
signal.signal(signal.SIGALRM, timeout_handler)
if __name__ == "__main__":
publisher.port = 6380
@ -44,10 +53,17 @@ if __name__ == "__main__":
print message
if message is not None:
paste = Paste.Paste(message)
for word, score in paste._get_top_words().items():
if len(word) >= 4:
msg = '{} {} {}'.format(paste.p_path, word, score)
p.populate_set_out(msg)
signal.alarm(5)
try:
for word, score in paste._get_top_words().items():
if len(word) >= 4:
msg = '{} {} {}'.format(paste.p_path, word, score)
p.populate_set_out(msg)
except TimeoutException:
print ("{0} processing timeout".format(paste.p_path))
continue
else:
signal.alarm(0)
else:
publisher.debug("Tokeniser is idling 10s")
time.sleep(10)