Preserve user's privacy with k-anonymity
See original GitHub issueAs I understand, currently, the user submits a video ID (e.g. dQw4w9WgXcQ
), and gets back a single JSON-object.
Here I propose to add a new endpoint, e.g. /api/anonymousGetVideoSponsorTimes
that takes the first n (e.g. 5) characters of a hash (sha1 is fine) and returns a list of possible results, like in the example below.
This approach is used in Troy Hunt’s Have I Been Pwned API; see https://www.troyhunt.com/ive-just-launched-pwned-passwords-version-2/ and https://blog.cloudflare.com/validating-leaked-passwords-with-k-anonymity/ for example.
input: { hash_prefix: <sha1sum("dQw4w9WgXcQ").substr(0,5)> }
(e.g. { hash_prefix: '3dd08' }
)
output:
[
{
videoID: 'dQw4w9WgXcQ',
sponorTimes: array [float],
UUIDs: array [string] //The ID for this sponsor time, used to submit votes
},
{
videoID: 'ah20943fdhj7'
sponorTimes: array [float],
UUIDs: array [string] //The ID for this sponsor time, used to submit votes
},
// ...
]
Since Youtube IDs are sparse, and furthermore SponsorBlock only has a small part of IDs indexed, each query will return only a small amount of results, if any. If result length were to get out of hands in the future, it would be easy to increase the number of input characters required.
For performance reasons, the database should grow a new column, sha1sum
. A pseudo-SQL query for such a request might look like this:
SELECT * FROM sponsorTimes WHERE sha1sum LIKE '3dd08%'
Which hashing algorithm is used is not very important, as the user will only send a fraction of the hash to the server. SHA1 has a reasonable length, I’d say. (let’s avoid MD5, though 😉 )
Issue Analytics
- State:
- Created 4 years ago
- Reactions:19
- Comments:38 (19 by maintainers)
Top GitHub Comments
I came upon SponsorBlock today for the first time and a browser-around let me to this issue. As I personally like to worry about privacy and would not want to send every single video ID to a SponsorBlockServer instance (unless maybe I was operating one within my own network). So I considered looking into this and maybe doing a PR. But first I would like to share some thoughts.
To start I would like to point at an implementation of k-anonimity for data queries that pre-dates Pwned Passwords: the “Safe Browsing” (Chrome) and “Block dangerous and deceptive content” (Firefox) systems. Information about it is a bit spread around, but some good resources are: the API Documentation (look for “Update API”), more links collected on the Mozilla Wiki, and a study by Gerbet et al.
I think this is a good implementation to keep in the back of our heads when continueing the discussion because of the many parallels there are. Safe Browsing is set-up to facilitate a client (web browser) wanting to check a centralised repository (Google’s collection of mallicious links) for something about a resource (website) without telling the repository about every resource the client uses. Now make the client a video player, the repository SponsorBlockServer, and the resource a YouTube video, and we are in the exact same situation.
Safe Browsing also has a number of points that address some arguments raised by this discussion. You do not need to study it before reading on, just do not be surprised if I reference it again (and again, and again) 😄
I will consider the threat vector as: the SponsorBlockServer will get to know every video the user watches. Leave other discussions for later. (If we do not trust the network, we may need obfuscation techniques like padding. If we do not trust the client … well that is a separate can of worms.)
Hashing
I would say hashing is absolutely required. It is the only way to ensure no actual information about the video ID is shared with SponsorBlockServer. Even part of a video ID is still information. Further more, are we sure YouTube video IDs are randomly distributed throughout the possible ID space? Is a
-
just as likely to be part of an ID as ana
? I do not have a big enough dataset to run this analysis and I do not think we need to do so at all. Hashing means we eliminate that problem.It does not really matter what hashing algorithm is chosen. As long as it upholds the default assumptions for a hash we can use it to map both value spaces to eachother: the amount of IDs known to SponsorBlockServer to the amount of IDs that can possibly exist in YouTube.
Pwned Password uses SHA-1. But Troy has also explained that he is not really hashing to seriously protect the data. Only to make sure “any personal info” like email addresses “in the source data is obfuscated”. People were also “cracking the passwords for fun” and succeeding at this. (Multiple discussions have happened around this, quotes taken from a February 2018 blog post.)
I am honestly not sure it matters much for prefixes in our specific case, but Safe Browsing uses SHA-256 and as I said might be closer to our flow. I also do not expect SponsorBlock clients to exist on many platforms that would not have access to SHA-256 primitives.
So let us say we hash every video identifier with a SHA-256 function. Let us also assume that storage and querying the database is trivial. (Because any debate about
TEXT
vsBLOB
and optimisation ofWHERE
statements can be had when actual benchmarking can be done.)What to hash
This is something that was only very shortly touched on in the discussion here by @phiresky:
This is a step that once again exists in the Safe Browsing API. They do URL canonicalisation. However my recommendation would be to not do that. In my experience anything close to trying to normalise a URL leads to differences between clients. As a quick note: RFC 3986 includes steps for “Normalization”, the WHATWG URL standard includes steps for “equivalence”, Safe Browsing has steps for “Canonicalization”. If you are unlucky the language you are writing your client in will have access to anything from 0 to a 100 implementations of any of these.
Not only that, you would also end up having to specify additional steps per platform. YouTube has lots of possible URLs pointing at the same video. Sometimes from entirely different domains (
youtu.be
anyone?).Instead stick to clear identifiers supplied by the platforms. The YouTube video ID is much less ambiguous than any URL will be. It is also likely that a client already has logic to extract the video ID from a bunch of different sources. Letting them work with that is better.
If we really value compatibility with other platforms simply generating a URI of our own suffices. Instruct clients to hash
yt:
followed by the YouTube video ID. That would allow simple future expansion without messing with URLs.Hash prefix length
Setting the length of the prefix is a balancing act. The longer the prefix, the less privacy is given. This goes the full range from 0 to 256 bits (in the case of SHA-256).
If the client asks for a list of possibilities with a 0 bit long prefix it is really just asking for the entire database. This gives complete privacy to the user as the server gets no clues about what video was being accessed. But of course it is a bit annoying for all parties involved as the client would constantly request lots and lots of unneccessary data.
If the clients asks for a list of possibilities with a 256 bit long prefix it is really just sending the full hash to the server. This gives almost¹ no privacy to the user as the server could match this hash 1:1 with a video ID to know exactly what was being watched. So this is no win for the user.
The number we are aiming for is somewhere in the middle. I am however not convinced by @phiresky’s calculation of the prefix length. The length there seems to be argued from the data perspective and not from the user perspective. I do not even think there is any privacy difference depending on how many results the server returns. There is no relation I can see between number of results and the threat vector of not wanting to tell the server about what video I am watching.
From the server perspective I do not think we care. There is already an endpoint that allows the client to submit the exact video ID, so obviously we are OK with the 256 bit variant. The database is not secret and infact readily downloadable, so we are also OK with the 0 bit variant.
The only argument from the server side I can think of is to protect against a number of DoS problems. When the database grows, it may not always be OK to support continuous downloads of the entire thing. Server instances may want to scale up the minimum hash length from 0 to a more managable number in accordance with total database growth. Maybe in such a way that response sizes stay under a certain number of bytes.
From the client perspective I think we want to send as little information as possible. This is where the user is and where we care about the privacy. But here are also a number of platform limitations that need to be considered. Like @ajayyy mentioned before about web browsers, some platforms may put limitations on storage. Other platforms may be on limited bandwidth.
So for a first implementation I was thinking: why not leave it completely up to the client what prefix length it sends and only pick a bottom limit for the server?
We have 133417 sponsorTimes in the latest DB (just downloaded from the mirror). If we aim to never have an API response include more than a 1000 we need a minimum length that cuts our hash space in approximately 134 blocks. That is just the first 8 bits of the hash (or the first 2 hexadecimal characters).²
If the client feels this is problematic because of storage, memory, or bandwidth concerns, it can opt to send a longer prefix whilst sacrificing some of its user’s privacy.
Future of prefixes
The Safe Browsing API once again has an interesting system. The client can download a list of prefixes straight from the server. This is interesting for a number of reasons:
Now this of course assumes the client has a way to store this list of prefixes and implements logic to keep it up to date. Therefor I am not entirely sure it makes sense to make this a must from day one.
Pull request
Like I said at the start, I am considering looking into this and getting started on a pull request. In the mean time I would love to hear everyone’s input on my thoughts. Whether you think I am completely missing the mark, right on it, or anywhere in between. (Shall we rate it 0 to 256 bits? 😉)
You will also find me in the SponsorBlock Discord as Zegnat if you would like to discuss any of the above with me outside of this issue.
¹: almost because if the video ID was not already known to the list, the hash needs to be reversed ²: this is just math. In reality we may have some uneven distributions through sheer randomness of the data. As our full dataset is still relatively small it would be simple to run the numbers and pick a good cut-off point.
https://github.com/ajayyy/SponsorBlockServer/pull/127