Serialization of RPC between main process and the compiler
See original GitHub issueMotivation
Currently, we have:
- Main process and compiler process communicating over Unix socket
- The protocol uses pickle as serialization
- Compiler process connects to Postgres on its own
This isn’t going to work for the main process in Rust and supporting cloud:
- While it’s possible to work with Pickle in Rust it isn’t a good idea
- We want a many-to-many relationship between the processes in the cloud
- Holding by a compiler onto a Postgres connection is inefficient (connections are costly in Postgres)
The Proposed Protocol
Framing
@tailhook is used to use WebSocket framing for everything nowadays:
- It’s standard, ubiquitously supported
- Sufficiently extensible
- Exposing (Prometheus?) metrics over HTTP on the same port is a plus
- It has a standard connection liveness protocol (Ping/Pong packets)
Alternative Framing
- We could use HTTP for most things, but being able to find out connection liveness reliably is a very big advantage. And we’ll probably need some non-RPC messages, see below
- Roll-your-own will require provisioning ping/pong and future extensions. While it isn’t particularly complex, I don’t think it’s justified.
- Use postgres framing. The downsides are:
- It’s just framing, we still need to specify how actual request/response, ping/pong and other packets look like
- We still need to decide how to serialize or own data
Serialization
- It’s standard
- There is Concise data definition language (CDDL) to describe CBOR structures
- It’s binary, fast and sufficiently extensible
- It’s relatively compact. While preserving field names it’s still comparable to protobuf after gzipping (there is a standard compression extension in websocket protocol when we need one). We can also replace field names with 1-byte numbers or use arrays if we have lots of small objects.
- There are fast encoders for every language including Python and Rust
Alternative Serialization
Lots of them:
- Protobuf is not a starter.
- Cap’n’proto is unnecessarily complex and bad at backward compatibility
- JSON is bad at binary data and not compact enough
- Postgres protocol: there are two different protocol serializations there: one for message structure and one for the data. The latter one requires describing types upfront and is suited only for tabular data, and also is not very compact. The former is not self-descriptive. Which makes debugging harder.
Can go over and over, have to just pick one.
Semantics
To make complier completely stateless, we move all communication to Postgres into the main process.
Compile To SQL
There are perhaps two API calls:
compile_to_sql(query_text, schema_hash) -> (compiled_query, Option<changes_to_schema>)
compile_to_sql_given_schema(query, schema) -> (compiled_query, Option<changes_to_schema>)
Note:
- It’s stateless. If there is no schema with that hash compiler returns an error and the main process calls the second method.
- The slow path here is when the actual schema changes, which is fine.
- ❓ The DDLs probably don’t need any dedicated processes with these semantics
- ❓ Schemas might not need to be tied to a user or a database. So multiple tutorial users can share the same schema if the hash matches.
There are few performance optimizations can be done on top:
- The main process keeping track of which schemas are cached (i.e. compiler publishes hashes as it receives schemas)
- Forcing preload of specific schemas when connecting to a compiler (this is probably only justified in the paid tier).
The obvious question here is: Is it fast enough to transfer schema like this?
- Well, obviously we need to do it one way or another
- For paid tier, it will always be preloaded anyway (but asserting on a hash is still useful)
- For free tiers (i.e. try.edgedb.com) we should have a small limit on the schema size to avoid DoS (and hopefully we can reuse the same schema for most users)
Alternative Semantics
Well, there are a lot of different approaches and combinations on them. Please comment!
@tailhook: I’ve marked by “❓” my pure speculations that need more research.
Update 1: added notes about using postgres protocol Update 2: mention CDDL
Issue Analytics
- State:
- Created 4 years ago
- Reactions:2
- Comments:26 (26 by maintainers)
Top Results From Across the Web
Remote Procedure Calls
When B returns, the return value is passed to A and A continues execution. This mechanism is called the Remote Procedure Call (RPC)....
Read more >Serialisation - Jan Newmarch
Serialization just encodes/decodes data to and from a sequence of bytes. RPC (remote procedure call) transfers these bytes across the network along with...
Read more >RPC and Serialization with Hadoop, Thrift, and Protocol Buffers
These types can all be serialized using Hadoop's specialized serialization format, based on Writable.
Read more >Remote Procedure Calls (RPC)
What Is RPC. RPC is a powerful technique for constructing distributed, client-server based applications. It is based on extending the notion of conventional, ......
Read more >Cerealization and RPC with Cap'n Proto - YouTube
Cap'n Proto is a serialization and RPC ( remote procedure call ) framework built by the primary author of Google's Protobuf. "Cap'n Proto...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
As long as it’s not worse than what we currently have, I’m fine with whatever approach. I’m NOT fine with having to hunt and
kill
orphaned subprocesses. In my workflow I restart the server very often, so it needs to start quickly and shut down cleanly. Docker and other virtualization/containerization is a non-starter for me due to long startup time.Also, @tailhook, please remember that we are discussing an implementation of a single RPC interface with a handful of methods, which is not public, has a single client, and no requirement to be backward compatible.