hash_tagging: Undefined behavior within Python threading can cause incomplete processing
See original GitHub issueDescription of problem:
In the hash_plugins analyzer, threads are not instantiated in the same process as the one in which they are executed. This pattern causes an undefined behavior of the is_alive
function, depending on the OS/Python version (see snippets below)
Therefore, in hash_tagging plugins, if the analysis queue is empty but the analyzer still has work to do, the analyzer will be killed, resulting in a partially executed task (for example, when the plaso database is really small, the analysis is not executed).
In my opinion, the best option is to instantiate the thread class just before it starts https://github.com/log2timeline/plaso/blob/f6a18bcaad24d0a3de2c029a5e341901bb1ccb59/plaso/analysis/hash_tagging.py#L265-L267 rather than in init https://github.com/log2timeline/plaso/blob/f6a18bcaad24d0a3de2c029a5e341901bb1ccb59/plaso/analysis/hash_tagging.py#L248 .The main disadvantage is that the TestConnection function called in the cli.helper.* classes would have to call class/static method.
Further description of the undefinied beahviour :
For example the following snippet (from https://stackoverflow.com/questions/57814933/is-alive-always-returns-false-when-called-on-a-thread-from-inside-multiprocess) result in different result deping of the os :
- Ubuntu 20.04/Python 3.8 -> worker.is_alive() always return False.
- Mac os (test even if not supported by plaso)/Python 3.9 -> TypeError: cannot pickle ‘_thread.lock’ object
- Centos/Python 3.6 -> worker.is_alive() works as expected.
from threading import Thread
from multiprocessing import Process
import time
class WorkerThread(Thread):
def run(self):
i = 0
while i < 10:
print ("worker running..")
time.sleep(1)
i += 1
class ProcessClass:
def run_worker(self, worker):
self.worker = worker
self.worker.daemon = True
self.worker.start()
i = 0
while i < 12:
print (f"Is worker thread alive? {self.worker.is_alive()}")
i += 1
time.sleep(1)
worker = WorkerThread()
processclass = ProcessClass()
parentProcess = Process(target = processclass.run_worker, args = (worker,))
parentProcess.start()
Command line and arguments:
psort.py --analysis nsrlsvr test.plaso
Plaso version:
- 20211229
Operating system Plaso is running on:
- Ubuntu 20.04
Installation method:
- installed from [GiFT PPA][https://launchpad.net/~gift] stable track
- Manually
Issue Analytics
- State:
- Created 2 years ago
- Reactions:2
- Comments:8 (5 by maintainers)
Top GitHub Comments
As long as the API stays the same of the server and what the end-to-end tests use and no incompatible licenses are introduced the solution should be fine
Why ?