What are the advantages of gathering facts beforehand ?
See original GitHub issueIs your feature request related to a problem? Please describe
I have been modifying a local version of pyinfra in order to solve different issues (some of them posted as github issues here), and I came to the conclusion that there is not much advantages in gathering facts before operations. Actually, most of the limitations and weird behaviors I see would be solved by gathering facts along the way.
I would be interested to discuss to pros and cons and contribute in modifying the way pyinfra work if necessary.
Describe the solution you’d like
I have a small list of the advantages of gathering facts before operations that requires them:
- Since there is no need to come up with a list of facts to gather beforehand, the whole problem of ordering operations goes away. Especially, the code that browse the stack to come up with operation order is unnecessary, and this code does not output the correct order in several cases.
- There is no need for the
preserve_loop_order
magic anymore. It was anyway very counter-intuitive to have operations in loops not executed in the expected order. - Most usage of
assume_present
arguments becomes unnecessary since facts will reflect the correct state of the machine right before an operation is performed. - The execution flow is easier to understand, for example, simple things like creating a directory and checking if that directory exists would simply works.
- The whole concept of nested operations is moot or becomes very limited. It seems the main use case was to run an operation, get the output and perform more operations based on the output. But with facts gathered along the way, the output of the command is available immediately and conditional logic can be written directly within the main python script (instead of the callback).
- The is no need to support dynamic facts, since facts are dynamic by default.
- We can still operate a cache to make sure facts are not gathered unnecessary and, just like ansible, provide ways to invalidate the cache for arbitrary operations like
server.shell
.
Issue Analytics
- State:
- Created a year ago
- Comments:7 (3 by maintainers)
Top Results From Across the Web
GATHERING INFORMATION - Tips for conducting program ...
Each approach has advantages and disadvantages. It is important that you select one best suited to your program. The following table defines ...
Read more >10 benefits of fact-finding | ThinkAdvisor
Fact -finding helps us identify how much a prospect can afford and is willing to spend. That knowledge can help avoid or defuse...
Read more >Effective Fact-finding Methods for Gathering Information
Advantages of Questionnaires People can fill the forms and give answers freely to the analyst. This technique is inexpensive. Users are more ...
Read more >Gathering Information about Your Audience | Public Speaking
Questionnaires have advantages over other types of surveys in that they are cheap, do not require as much effort from the questioner as...
Read more >Section 2. Information Gathering and Synthesis
As we've mentioned, the activities of information gathering and synthesis are needed both to create the original program and to develop an evaluation...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Thank you for writing this up @julienlavergne - this has recently been on my mind also. I’m going to document the context of why this is the case below (this will be long I think 😃), and give my thoughts.
Firstly, let’s split the problem into two distinct parts:
Historical Context
Prior to v2, pyinfra relied on line-number ordering and pre-execution fact gathering to achieve it’s high performance. The reason for this is that operations were generated on hosts sequentially, rather than in parallel. As facts were required, they were gathered in parallel on all hosts (whether or not they need the specific fact). For example:
To generate the commands to execute the following would happen (v0.x, v1.x):
deploy.py
withhost
set to that server b. as facts are required (files.File
), load them on all 4 servers in parallelBecause facts were loaded in parallel (2.a), each iteration of 2 got quicker and quicker as most/all facts were pre-cached for the current host. This is why facts were(are) gathered before execution.
The above example also highlights why operation order is generated from line numbers - because the same code (
deploy.py
) is executed for multiple hosts, the order in which operations are called is inconsistent. In the above example the user expects execution flow to behave like:files.file(path=web-file)
on web-01 & web-02files.file(path=db-file)
on db-01 & db-01By taking operation order as they are called this would not be possible. Note: this does not affect deploys against a single host target, where operation call order would work.
What can we do now?
Back to v2 and your points above, I’ll split my thoughts into the two problems above:
Operation ordering
Unfortunately I don’t think we can avoid this without breaking operation execution flow, particularly where there are multiple code paths for different hosts involved in a deploy. The line/stack ordering enforces “correct” ordering - except loops and context processors. The general assumption being that deploy files are generally “simplified Python” consisting of operation calls, conditional statements and functions. I’m not a fan of this gotcha and would be keen to investigate alternatives!
While I don’t see a way to remove the line ordering mechanism, I would like to have it automatically handle loops and context processors if possible. In v0.x pyinfra would modify the
ast
of deploy code before execution to achieve ordering without line numbers and that may be a workable solution. Alternatively it might be possible to modify the loop detection code to automatically re-order them as expected.Fact gathering
Because of the operation ordering issue, it’s still not possible to provide output from an operation immediately. The deploy code must be run once before any operations are actually executed to generate the order, which unfortunately makes it impossible to have the output included.
I would absolutely love to remove this, it’s a real pain and a massive gotcha. v2 makes it entirely possible to do by having operations (re)collect facts at execution time. The only drawback is the list of changes pre-execution may not be correct; ie if you do a dry run deploy first you expect the number of commands proposed to match those executed, and collecting facts at execution may break this. One option could be to display “up to X” commands per operation, because we can make reasonable assumptions that certain facts will change (files) and others will not (system OS).
Thoughts
Collecting some thoughts below on the more general philosophy of pyinfra and how it works.
I do think the “dry run” pyinfra offers is a powerful tool that has a lot of unused potential. On a basic level pyinfra could support terraform style approval steps. Even more interesting would be the idea of creating a diff file that can then be moved somewhere else for execution - pyinfra needn’t even be the tool doing the execution.
The whole two-stage deploy mechanism has consistently provided complexity over the last 7(!) years, but has also enabled writing almost-normal Python code to generate operations that execute in a similar way to tools like Ansible. I’ve yet to encounter something that wasn’t possible (but have seen things not possible in other tools). Examples & documentation would help a lot here I think.
Today pyinfra seems to be a hybrid of a Ansible/SaltStack-like mostly-state-base ddeployment tool and Fabric/Parallel-SSH command execution tool. This is definitely both an advantage in terms of high flexibility but also a disadvantage because it comes with some gotchas that make it “almost like Python” at times.
I hope this provides some context, please let me know if anything doesn’t make sense and would love to hear thoughts from any pyinfra users on the above. Ultimately I think any changes to these systems are on the table assuming enough support and technical possibility 😃
I think at least the following would be difficult with single-run execution:
The ‘same’[1] operation cannot be grouped in the output (a)
We could use some command line ANSI escape code magic to update previously printed operations for hosts that have now also hit them. This might look something like:
…2 seconds later…
The ‘same’[1] operation cannot be interactively approved/disapproved as a group (b)
While not ideal, this could be solved with a synchronization key:
The key could just be
name
, or it could be something separate likekey
orsync_key
.Inter-host dependent operations ©
As @julienlavergne pointed out this could be solved using a synchronization API:
asyncio.Event doesn’t need
key
because it’s used in the same execution context. The different host runs of pyinfra however do not share the same context (I think?).I think using a key is fine, but we could move per-host code into the same context like so:
Where this file is executed once and then
deploy
is called for each host. This calling for each host could event be delegated to the the deploy file itself:Although I don’t know how pyinfra would be able to figure out which host an operation is being executed for.
Generating a diff file for later execution (d)
For this one I cannot really think of a solution.
[1] I guess operations called with different arguments are considered the same.
P.S. Amazing project, thanks for all the hard work! ❤️