Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Serialization of RPC between main process and the compiler

See original GitHub issue

Motivation

Currently, we have:

Main process and compiler process communicating over Unix socket
The protocol uses pickle as serialization
Compiler process connects to Postgres on its own

This isn’t going to work for the main process in Rust and supporting cloud:

While it’s possible to work with Pickle in Rust it isn’t a good idea
We want a many-to-many relationship between the processes in the cloud
Holding by a compiler onto a Postgres connection is inefficient (connections are costly in Postgres)

The Proposed Protocol

Framing

@tailhook is used to use WebSocket framing for everything nowadays:

It’s standard, ubiquitously supported
Sufficiently extensible
Exposing (Prometheus?) metrics over HTTP on the same port is a plus
It has a standard connection liveness protocol (Ping/Pong packets)

Alternative Framing

We could use HTTP for most things, but being able to find out connection liveness reliably is a very big advantage. And we’ll probably need some non-RPC messages, see below
Roll-your-own will require provisioning ping/pong and future extensions. While it isn’t particularly complex, I don’t think it’s justified.
Use postgres framing. The downsides are:
- It’s just framing, we still need to specify how actual request/response, ping/pong and other packets look like
- We still need to decide how to serialize or own data

Serialization

@tailhook is used to CBOR:

It’s standard
There is Concise data definition language (CDDL) to describe CBOR structures
It’s binary, fast and sufficiently extensible
It’s relatively compact. While preserving field names it’s still comparable to protobuf after gzipping (there is a standard compression extension in websocket protocol when we need one). We can also replace field names with 1-byte numbers or use arrays if we have lots of small objects.
There are fast encoders for every language including Python and Rust

Alternative Serialization

Lots of them:

Protobuf is not a starter.
Cap’n’proto is unnecessarily complex and bad at backward compatibility
JSON is bad at binary data and not compact enough
Postgres protocol: there are two different protocol serializations there: one for message structure and one for the data. The latter one requires describing types upfront and is suited only for tabular data, and also is not very compact. The former is not self-descriptive. Which makes debugging harder.

Can go over and over, have to just pick one.

Semantics

To make complier completely stateless, we move all communication to Postgres into the main process.

Compile To SQL

There are perhaps two API calls:

compile_to_sql(query_text, schema_hash) -> (compiled_query, Option<changes_to_schema>)
compile_to_sql_given_schema(query, schema) -> (compiled_query, Option<changes_to_schema>)

Note:

It’s stateless. If there is no schema with that hash compiler returns an error and the main process calls the second method.
The slow path here is when the actual schema changes, which is fine.
❓ The DDLs probably don’t need any dedicated processes with these semantics
❓ Schemas might not need to be tied to a user or a database. So multiple tutorial users can share the same schema if the hash matches.

There are few performance optimizations can be done on top:

The main process keeping track of which schemas are cached (i.e. compiler publishes hashes as it receives schemas)
Forcing preload of specific schemas when connecting to a compiler (this is probably only justified in the paid tier).

The obvious question here is: Is it fast enough to transfer schema like this?

Well, obviously we need to do it one way or another
For paid tier, it will always be preloaded anyway (but asserting on a hash is still useful)
For free tiers (i.e. try.edgedb.com) we should have a small limit on the schema size to avoid DoS (and hopefully we can reuse the same schema for most users)

Alternative Semantics

Well, there are a lot of different approaches and combinations on them. Please comment!

@tailhook: I’ve marked by “❓” my pure speculations that need more research.

Update 1: added notes about using postgres protocol Update 2: mention CDDL

Issue Analytics

State:
Created 4 years ago
Reactions:2
Comments:26 (26 by maintainers)

Top GitHub Comments

1reaction

elpranscommented, Dec 26, 2019

So what is a problem with that

As long as it’s not worse than what we currently have, I’m fine with whatever approach. I’m NOT fine with having to hunt and kill orphaned subprocesses. In my workflow I restart the server very often, so it needs to start quickly and shut down cleanly. Docker and other virtualization/containerization is a non-starter for me due to long startup time.

1reaction

elpranscommented, Dec 26, 2019

Also, @tailhook, please remember that we are discussing an implementation of a single RPC interface with a handful of methods, which is not public, has a single client, and no requirement to be backward compatible.

Top Results From Across the Web

Remote Procedure Calls

When B returns, the return value is passed to A and A continues execution. This mechanism is called the Remote Procedure Call (RPC)....

Serialisation - Jan Newmarch

Serialization just encodes/decodes data to and from a sequence of bytes. RPC (remote procedure call) transfers these bytes across the network along with...

RPC and Serialization with Hadoop, Thrift, and Protocol Buffers

These types can all be serialized using Hadoop's specialized serialization format, based on Writable.

Remote Procedure Calls (RPC)

What Is RPC. RPC is a powerful technique for constructing distributed, client-server based applications. It is based on extending the notion of conventional, ......

Cerealization and RPC with Cap'n Proto - YouTube

Cap'n Proto is a serialization and RPC ( remote procedure call ) framework built by the primary author of Google's Protobuf. "Cap'n Proto...