RFC: Voice Receive API Design/Usage
See original GitHub issueNote: DO NOT use this in production. The code is messy (and possibly broken) and probably filled with debug prints. Use only with the intent to experiment or give feedback, although almost everything in the code is subject to change.
Behold the voice receive RFC. This is where I ask for design suggestions and feedback. Unfortunately not many people seem to have any idea of what their ideal voice receive api would look like so it falls to me to come up with everything. Should anyone have any questions/comments/concerns/complaints/demands please post them here. I will be posting the tentative design components here for feedback and will update them occasionally. For more detailed information on my progress see the project on my fork. I will also be adding an example soonish.
Overview
The main concept behind my voice receive design is to mirror the voice send api as much as possible. However, due to receive being more complex than send, I’ve had to take some liberties in creating some new concepts and functionality for the more complex parts. The basic usage should be relatively familiar:
vc = await channel.connect()
vc.listen(MySink())
The voice send api calls an object that produces PCM packets a Source
, whereas the receive api refers to them as a Sink
. Sources have a read()
function that produces PCM packets, so Sinks have a write(data)
function that does something with PCM packets. Sinks can also optionally accept opus data to bypass the decoding stage if you so desire. The signature of the write(data)
function is currently just a payload blob with the opus data, pcm data, and rtp packet, mostly for my own convenience during development. This is subject to change later on.
The new VoiceClient functions are basically the same as the send variants, with listen()
being the new counterpart to play()
.
Note: The
stop()
function has been changed to stop both playing and listening. I have addedstop_playing()
andstop_listening()
for individual control.
Built in Sinks
For simply saving voice data to a file, you can use the built in WaveSink
to write them to a wav file. The way I have this currently implemented, however, is completely broken for more than one user.
Note: Here lies my biggest problem. I currently do not have any way to combine multiple voice “streams” into one stream. The way this works is Discord sends packets for all users on the same socket, differentiated by an id (aka ssrc, from RTP spec). These packets have timestamps, but with a random start offset, per ssrc. RTP has a mechanism where the reference time is sent in a control packet, but as far as I can tell, Discord doesn’t send these control packets. As such, I have no way of properly synchronizing streams without excessive guesswork based on arrival time in the socket (unreliable at best). Until I can solve this there will be a few holes in the design, for example, how to record the whole conversation in a voice channel instead of individual users.
Sinks can be composed much like Sources can (PCMVolumeTransformer+FFmpegPCMAudio, etc). I will have some built in sinks for handling various control actions, such as filtering by user or predicate.
# only listen to message.author
vc.listen(UserFilter(MySink(), message.author))
# listen for 10 seconds
vc.listen(TimedFilter(MySink(), 10))
# arbitrary predicate, could check flags, permissions, etc
vc.listen(ConditionalFilter(MySink(), lambda data: ...))
and so forth. As usual, these are subject to change when I go over this part of the design again.
As mentioned before, mixing is still my largest unsolved problem. Combining all voice data in a channel into one stream is surely a common use case, and i’ll do my best to try and figure out a solution, but I can’t promise anything yet. If it turns out that my solution is too hacky, I might have to put it in some ext package on pypi (see: ext.colors).
For volume control, I recently found that libopus has a gain setting in the decoder. This is probably faster and more accurate than altering pcm packets after they’ve been decoded. Unfortunately, I haven’t quite figured out how to expose this setting yet, so I don’t have any public api to show for it.
That should account for most of the public api part that i’ve designed so far. I still have a lot of miscellaneous things to do so no ETA. Again, if you have any feedback whatsoever please make yourself known either here or in the discord server.
Old issue content
Voice receive has been an occasionally requested feature for a very long time now, but its actual usage seems to be limited to some form of voice recognition (for commands or some other form of automation) and for recording. This is all fine and dandy, but issue that appears is how to expose this in the library. With the v1.0 rewrite voice send was redesigned to be more composable and give more control to the user. However, that also means the voice systems were designed with *only* send in mind. Voice send is far less complex than receive, so some considerable amount of effort has to go into designing an api that both fits well with the library and, more importantly, is useful and user friendly.Essentially, voice receive is reading RTP packets from a socket and then decrypting and decoding them from Opus to PCM. At this point is fairly trivial to simply expose an event that produces PCM data and a member object, but this is far from useful or user friendly. This leads us to the main question:
Those who want to use voice receive, what kind of api do you want? What kind of api would you expect from the library?
Ideally, simple or common use cases should be trivial to accomplish. More complex requirements should not be difficult or unwieldy. The library should have enough batteries included to handle most simple situations, such as saving audio to a file. What form this takes is what needs to be decided.
My part in all this
I’ve been working on voice receive for a few months now, with most of my time spend trying to wrangle technical or design related problems. Those who have been listening have heard of my struggles with RTCP packets (or rather lack thereof), the ultimate goal being functional stream mixing. This means combining the audio streams from multiple users into one stream of data (each user has their own stream). In the RTP spec, discord is supposed to send RTCP packets with the timing information necessary to synchronize these streams together. Actually mixing them afterwards is trivial. The problem is that Discord does not reliably send RTCP packets. You may or may not get them depending on which voice server you end up on. Trying to bodge some optimistic yolo handling is too sketchy to be the official implementation, and is subject to the kind of problems that creep in from having questionable code. Danny would never approve of that kind of code anyways, seeing as he then has to support it.
Possible design
In the following section I will explain the design I came up with when working on this. If you want to reply to this RFC without being biased by reading my design, skip this section and come back after you’ve had your say.
Click to be biased expand
My idea was to essentially mirror the voice send api. This seemed like the logical conclusion to me. The voice send api has source objects which produce voice data. These can be composed in any way and eventually fed to the lib where it handles it from there. So why not do the reverse for voice receive? Have objects that receive voice data and do some arbitrary operations on it. These can be composed the same way source objects can be: usually nested, but advanced usages can have branching or some other form of dynamic usage. I decided to call these objects Sinks. You would use them the same way you would sources.
# using a source
voice_client.play(discord.FFmpegPCMAudio("song.mp3"))
# using a sink
voice_client.read(discord.WaveSink("voice.wav"))
The WaveSink
is a simple built-in sink for writing the PCM data to a wav file. This may seem simple but it doesn’t actually work. Since there’s no way to mix audio, this can only work for one user, otherwise you end up mashing audio from multiple users together. It worked great with just one person though!
Technical issues aside, this demonstrates the main concept for simple usages. However the more I worked on this the more I realized that trying to mirror the voice send as much as possible just wouldn’t work due to how simple it was. There’s also the need to have controls for deciding who to handle data from (who to ignore), how long to record data for (besides explicitly stopping the whole reader), etc.
Basically, without mixing, my design is incomplete. I can’t complete it until I have a reliable way to mix audio since I can’t design an api around something that doesn’t exist. Supposedly discord is going to push out an update to all voice servers in the future, which hopefully means either RTCP will reliably be sent or there will be some other synchronization mechanism. I decided that if I do go through with this design i’ll have to complete it without a way to mix audio, since there’s just no good way to do it without RTCP packets for timing data.
Issue Analytics
- State:
- Created 6 years ago
- Reactions:51
- Comments:84 (14 by maintainers)
I have updated the OP. Anyone vaguely interested in this feature should read the new content.
The branch hasn’t been updated for a while, pending a redesign. I manually made all of the voice recv related changes over master a while back, and I’ve cherry picked these changes to be up to date with the current master at https://github.com/Gorialis/discord.py/tree/voice-recv-mk3
You can install it directly using:
Consistency and production warnings still apply. This branch is likely to spam your console while in use, and things may (and probably will) go wrong (and there are plenty of known bugs at the moment).
I intend to at some point work out voice ws flow myself and try to see if I can breathe life back into this project, perhaps looking at other libs that have implemented voice receive successfully for inspiration regarding sane frontends. No ETA on that yet, though.