question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

What are the advantages of gathering facts beforehand ?

See original GitHub issue

Is your feature request related to a problem? Please describe

I have been modifying a local version of pyinfra in order to solve different issues (some of them posted as github issues here), and I came to the conclusion that there is not much advantages in gathering facts before operations. Actually, most of the limitations and weird behaviors I see would be solved by gathering facts along the way.

I would be interested to discuss to pros and cons and contribute in modifying the way pyinfra work if necessary.

Describe the solution you’d like

I have a small list of the advantages of gathering facts before operations that requires them:

  • Since there is no need to come up with a list of facts to gather beforehand, the whole problem of ordering operations goes away. Especially, the code that browse the stack to come up with operation order is unnecessary, and this code does not output the correct order in several cases.
  • There is no need for the preserve_loop_order magic anymore. It was anyway very counter-intuitive to have operations in loops not executed in the expected order.
  • Most usage of assume_present arguments becomes unnecessary since facts will reflect the correct state of the machine right before an operation is performed.
  • The execution flow is easier to understand, for example, simple things like creating a directory and checking if that directory exists would simply works.
  • The whole concept of nested operations is moot or becomes very limited. It seems the main use case was to run an operation, get the output and perform more operations based on the output. But with facts gathered along the way, the output of the command is available immediately and conditional logic can be written directly within the main python script (instead of the callback).
  • The is no need to support dynamic facts, since facts are dynamic by default.
  • We can still operate a cache to make sure facts are not gathered unnecessary and, just like ansible, provide ways to invalidate the cache for arbitrary operations like server.shell.

Issue Analytics

  • State:open
  • Created a year ago
  • Comments:7 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
Fizzadarcommented, May 7, 2022

Thank you for writing this up @julienlavergne - this has recently been on my mind also. I’m going to document the context of why this is the case below (this will be long I think 😃), and give my thoughts.

Firstly, let’s split the problem into two distinct parts:

  • Ordering operations correctly
  • Gathering facts on operations (pre-execution vs. execution)

Historical Context

Prior to v2, pyinfra relied on line-number ordering and pre-execution fact gathering to achieve it’s high performance. The reason for this is that operations were generated on hosts sequentially, rather than in parallel. As facts were required, they were gathered in parallel on all hosts (whether or not they need the specific fact). For example:

# inventory.py
web_servers = ["web-01", "web-02"]
db_servers = ["db-01", "db-02"]

# deploy.py
if host in inventory.get_group("web_servers"):
    files.file(path="web-file")
else:
    files.file(path="db-file")

To generate the commands to execute the following would happen (v0.x, v1.x):

  1. Connect to all 4 servers in parallel
  2. For each server sequentially: a. execute deploy.py with host set to that server b. as facts are required (files.File), load them on all 4 servers in parallel
  3. Now we have our operation order & commands to run
  4. Execute all the operations in parallel across the hosts

Because facts were loaded in parallel (2.a), each iteration of 2 got quicker and quicker as most/all facts were pre-cached for the current host. This is why facts were(are) gathered before execution.

The above example also highlights why operation order is generated from line numbers - because the same code (deploy.py) is executed for multiple hosts, the order in which operations are called is inconsistent. In the above example the user expects execution flow to behave like:

  1. Execute files.file(path=web-file) on web-01 & web-02
  2. Wait for completion of 1
  3. Execute files.file(path=db-file) on db-01 & db-01

By taking operation order as they are called this would not be possible. Note: this does not affect deploys against a single host target, where operation call order would work.

What can we do now?

Back to v2 and your points above, I’ll split my thoughts into the two problems above:

Operation ordering

  • Since there is no need to come up with a list of facts to gather beforehand, the whole problem of ordering operations goes away. Especially, the code that browse the stack to come up with operation order is unnecessary, and this code does not output the correct order in several cases.
  • There is no need for the preserve_loop_order magic anymore. It was anyway very counter-intuitive to have operations in loops not executed in the expected order.

Unfortunately I don’t think we can avoid this without breaking operation execution flow, particularly where there are multiple code paths for different hosts involved in a deploy. The line/stack ordering enforces “correct” ordering - except loops and context processors. The general assumption being that deploy files are generally “simplified Python” consisting of operation calls, conditional statements and functions. I’m not a fan of this gotcha and would be keen to investigate alternatives!

While I don’t see a way to remove the line ordering mechanism, I would like to have it automatically handle loops and context processors if possible. In v0.x pyinfra would modify the ast of deploy code before execution to achieve ordering without line numbers and that may be a workable solution. Alternatively it might be possible to modify the loop detection code to automatically re-order them as expected.

Fact gathering

  • The whole concept of nested operations is moot or becomes very limited. It seems the main use case was to run an operation, get the output and perform more operations based on the output. But with facts gathered along the way, the output of the command is available immediately and conditional logic can be written directly within the main python script (instead of the callback).

Because of the operation ordering issue, it’s still not possible to provide output from an operation immediately. The deploy code must be run once before any operations are actually executed to generate the order, which unfortunately makes it impossible to have the output included.

  • Most usage of assume_present arguments becomes unnecessary since facts will reflect the correct state of the machine right before an operation is performed.

I would absolutely love to remove this, it’s a real pain and a massive gotcha. v2 makes it entirely possible to do by having operations (re)collect facts at execution time. The only drawback is the list of changes pre-execution may not be correct; ie if you do a dry run deploy first you expect the number of commands proposed to match those executed, and collecting facts at execution may break this. One option could be to display “up to X” commands per operation, because we can make reasonable assumptions that certain facts will change (files) and others will not (system OS).

Thoughts

Collecting some thoughts below on the more general philosophy of pyinfra and how it works.

I do think the “dry run” pyinfra offers is a powerful tool that has a lot of unused potential. On a basic level pyinfra could support terraform style approval steps. Even more interesting would be the idea of creating a diff file that can then be moved somewhere else for execution - pyinfra needn’t even be the tool doing the execution.

The whole two-stage deploy mechanism has consistently provided complexity over the last 7(!) years, but has also enabled writing almost-normal Python code to generate operations that execute in a similar way to tools like Ansible. I’ve yet to encounter something that wasn’t possible (but have seen things not possible in other tools). Examples & documentation would help a lot here I think.

Today pyinfra seems to be a hybrid of a Ansible/SaltStack-like mostly-state-base ddeployment tool and Fabric/Parallel-SSH command execution tool. This is definitely both an advantage in terms of high flexibility but also a disadvantage because it comes with some gotchas that make it “almost like Python” at times.


I hope this provides some context, please let me know if anything doesn’t make sense and would love to hear thoughts from any pyinfra users on the above. Ultimately I think any changes to these systems are on the table assuming enough support and technical possibility 😃

0reactions
ubipocommented, May 21, 2022

I think at least the following would be difficult with single-run execution:

  • the ‘same’[1] operation on each host:
    • cannot be displayed as a group in the summary output (a)
    • cannot be interactively approved/disapproved as a group (b)
  • synchronization of inter-host dependent operation is more difficult ©
  • generating a diff file for later execution (d)

The ‘same’[1] operation cannot be grouped in the output (a)

We could use some command line ANSI escape code magic to update previously printed operations for hosts that have now also hit them. This might look something like:

--> Starting operation: Apt/Packages (packages=['vim'])
    [host-1] No changes
    [host-2] *executing* <insert loading animation here>

--> Starting operation: Apt/Packages (packages=['vim'], present=False)
    [host-1] *executing* <insert loading animation here>
    [host-2] *waiting for execution* <insert loading animation here>

…2 seconds later…

--> Starting operation: Apt/Packages (packages=['vim'])
    [host-1] No changes
    [host-2] Success  <-- this line changed even though it was already printed

--> Starting operation: Apt/Packages (packages=['vim'], present=False)
    [host-1] Success  <-- same here
    [host-2] Success  <-- and here

The ‘same’[1] operation cannot be interactively approved/disapproved as a group (b)

While not ideal, this could be solved with a synchronization key:

burn_server_op_name = 'Set server on fire'
if host in inventory.get_group("flammable_nodes"):
  pyinfra.operations.server.burn_server(name=burn_server_op_name, appoval_needed=True)
else:
  pyinfra.operations.skip(name=burn_server_op_name)
--> Starting operation: Burn server (key='Set server on fire')
    Do you want to continue with 'Set server on fire' with the following changes:
    [host-1] Success: <some information about the planned execution>
    [host-2] Success: <some information about the planned execution>
    [host-3] Skipped
    Continue? [y/N]

The key could just be name, or it could be something separate like key or sync_key.

Inter-host dependent operations ©

As @julienlavergne pointed out this could be solved using a synchronization API:

if host in inventory.get_group("control_nodes"):
  # control stuff (1)
else:
  # data stuff

pyinfra.wait(key = "Wait for control-1")

if host not in inventory.get_group("control_nodes"):
  # data stuff that depends on (1) ("start ES on the three data nodes")

asyncio.Event doesn’t need key because it’s used in the same execution context. The different host runs of pyinfra however do not share the same context (I think?).

I think using a key is fine, but we could move per-host code into the same context like so:

control_done_event = asyncio.Event()

async def deploy(host):
  if host in inventory.get_group("control_nodes"):
    # control stuff (1)
    control_done_event.set()
  else:
    # data stuff
    await control_done_event.wait()
    # data stuff that depends on (1) ("start ES on the three data nodes")

Where this file is executed once and then deploy is called for each host. This calling for each host could event be delegated to the the deploy file itself:

await asyncio.gather(deploy(host) for host in hosts_to_execute_magic_variable)

Although I don’t know how pyinfra would be able to figure out which host an operation is being executed for.

Generating a diff file for later execution (d)

For this one I cannot really think of a solution.


[1] I guess operations called with different arguments are considered the same.

P.S. Amazing project, thanks for all the hard work! ❤️

Read more comments on GitHub >

github_iconTop Results From Across the Web

GATHERING INFORMATION - Tips for conducting program ...
Each approach has advantages and disadvantages. It is important that you select one best suited to your program. The following table defines ...
Read more >
10 benefits of fact-finding | ThinkAdvisor
Fact -finding helps us identify how much a prospect can afford and is willing to spend. That knowledge can help avoid or defuse...
Read more >
Effective Fact-finding Methods for Gathering Information
Advantages of Questionnaires​​ People can fill the forms and give answers freely to the analyst. This technique is inexpensive. Users are more ...
Read more >
Gathering Information about Your Audience | Public Speaking
Questionnaires have advantages over other types of surveys in that they are cheap, do not require as much effort from the questioner as...
Read more >
Section 2. Information Gathering and Synthesis
As we've mentioned, the activities of information gathering and synthesis are needed both to create the original program and to develop an evaluation...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found