question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

hash_tagging: Undefined behavior within Python threading can cause incomplete processing

See original GitHub issue

Description of problem:

In the hash_plugins analyzer, threads are not instantiated in the same process as the one in which they are executed. This pattern causes an undefined behavior of the is_alive function, depending on the OS/Python version (see snippets below)

Therefore, in hash_tagging plugins, if the analysis queue is empty but the analyzer still has work to do, the analyzer will be killed, resulting in a partially executed task (for example, when the plaso database is really small, the analysis is not executed).

In my opinion, the best option is to instantiate the thread class just before it starts https://github.com/log2timeline/plaso/blob/f6a18bcaad24d0a3de2c029a5e341901bb1ccb59/plaso/analysis/hash_tagging.py#L265-L267 rather than in init https://github.com/log2timeline/plaso/blob/f6a18bcaad24d0a3de2c029a5e341901bb1ccb59/plaso/analysis/hash_tagging.py#L248 .The main disadvantage is that the TestConnection function called in the cli.helper.* classes would have to call class/static method.

Further description of the undefinied beahviour :

For example the following snippet (from https://stackoverflow.com/questions/57814933/is-alive-always-returns-false-when-called-on-a-thread-from-inside-multiprocess) result in different result deping of the os :

  • Ubuntu 20.04/Python 3.8 -> worker.is_alive() always return False.
  • Mac os (test even if not supported by plaso)/Python 3.9 -> TypeError: cannot pickle ‘_thread.lock’ object
  • Centos/Python 3.6 -> worker.is_alive() works as expected.
from threading import Thread
from multiprocessing import Process
import time
class WorkerThread(Thread):
    def run(self):
        i = 0
        while i < 10:
            print ("worker running..")
            time.sleep(1)
            i += 1



class ProcessClass:
    def run_worker(self, worker):
        self.worker = worker
        self.worker.daemon = True
        self.worker.start()
        i = 0
        while i < 12:
            print (f"Is worker thread alive? {self.worker.is_alive()}")
            i += 1
            time.sleep(1)

            
worker = WorkerThread()
processclass = ProcessClass()
parentProcess = Process(target = processclass.run_worker, args = (worker,))
parentProcess.start()

Command line and arguments:

psort.py --analysis nsrlsvr test.plaso

Plaso version:

  • 20211229

Operating system Plaso is running on:

  • Ubuntu 20.04

Installation method:

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Reactions:2
  • Comments:8 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
joachimmetzcommented, Jun 8, 2022

As long as the API stays the same of the server and what the end-to-end tests use and no incompatible licenses are introduced the solution should be fine

0reactions
joachimmetzcommented, Aug 23, 2022

I suppose we can close it.

Why ?

Read more comments on GitHub >

github_iconTop Results From Across the Web

incomplete result when using ThreadPoolExecutor of ...
I used file open in the print_one function, so the output file was opened/closed every single thread, and this may cause a time...
Read more >
Issue 35902: Forking from background thread - Python tracker
If you fork from any thread but the main thread, you can run into undefined behavior. Daemon threads are a special property of...
Read more >
threading — Thread-based parallelism — Python 3.11.1 ...
The Thread class represents an activity that is run in a separate thread of control. There are two ways to specify the activity:...
Read more >
Initialization, Finalization, and Threads — Python 3.11.1 ...
In an application embedding Python, the Py_Initialize() function must be called before using any other Python/C API functions; with the exception of a...
Read more >
queue — A synchronized queue class — Python 3.11.1 ...
Source code: Lib/queue.py The queue module implements multi-producer, multi-consumer queues. It is especially useful in threaded programming when ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found