Datashards

Datashards is a new storage primitive that provides a base for building highly secure applications.
Using Datashards, information is not only end-to-end encrypted but also able to be stored and transmitted by untrusted third parties without risk of surveillance or tampering.

It allows users and application developers to reason about secure, private data online or offline.


Listen to librelounge.org 26 for a 47 minutes introduction

ActivityPub Conference Unconf Session
find a read.only backup of the #datashards etherpad here :

datashards is essentially two different things, one builds on the other.

Immutable datashards:

You’ve got a piece of data, you encrypt it with a random key (this is super simplified), take the hash of the object, that’s the content identifier. There’s a little bit more complexity to prevent against certain types of attacks, specifically data shape attacks, and detail about how we do the encryption, but that’s immutable datashards.

It’s a storage primitive that can be used for anything.

I’ll give a non-AP example:
Wouldn’t it be nice if you could distribute software with verifiable downloads?
Or it could be used inside an organisation, one department cannot access another department’s data.

It also lets you if a node goes down, it would be possible for other nodes to have that content. if witches.town goes away I can still see all thep osts.

This is an approach, not a full piece of software. We haven’t defined a routing layer.

IPFS is a similar technology but we solve two things it does not solve. We assume the data is encrypted. Three things… IPFS assumes you’re using internet and it has baked in routing layer. It’s baked in routing layer has some filecoin and other shit… we don’t assume that, we want to separate out the routing layer from the storage layer. Maybe your stores are on SUB keys and you’re sneakernetting them, or it’s an adhoc mesh network.

Second thing is we encrypt our data, which is designed to protect ourselves. You may need to hold data on behalf of me but you don’t know what is inside the data.

Three, we also do data shape protection, that’s a really important component.

Simple metaphor: you get a stegosaurus for your birthday and your parents wrap it up and you can see whatever is wrapped has a head and spikes and a tail. You can see what it is. What we do is take the data, chunk it into 32k chunks and encrypt those chunks so it’s not possible just based on the size ot know what it is.

Add padding if necessary.

Application to AP:

instead of putting an HTTP object as your activity… the activity can stay the same but your object would point to an immutable datashard.

Q: datashards only cares about identifying data? but not about locating it?

That’s right. We need to talk about that. We (Chris & I) think that each application has some different needs. If we said this is how it works that would be a mistake. For example we looked at using a DHT by default but came to the conclusion it did not make sense in AP and a gossip network made more sense in AP than a DHT because it more closely resembled the natural network you get in AP. We haven’t started defining those protocols becasue we wanted to focus on building the foundation, then we can build things on top.

Q: in AP objects are encrypted. how do other instances access objects?

When you get an IDSC URN it contains 3 things. The important thing is the key of the object, what it is, the self-authenticating structure. The hash is the object, the object is the hash. We also give you the key so we send you the key in the URN. This is the key to unlock the object. We also tell you the alogrithm you’re going to use and the suite that gets used, because that might change. We might have version 1 and need to update to version 2.

Q: so it encrypts it by itself?

No. The store is dumb, could be a USB key, just holds the data, does’nt know what’s inside. We don’t want store operators to know what’s inside. The ‘client’ (packer, shipper, we don’t now what word) is responsible for taking the data and encrypting it or decrypting it. It’s responsible for taking the pieces and putting them back and giving you the cleartext

Q: i have an instance it encrypts the object…

Where should that encryption be done? That encryption should be done as close to the user as possible, not on the instance. I had a long talk with nightpool about this exact question. Nightpool thinks it should be done on the AP server, I don’t think so because what if your AP server is serving lgbtq+ people in Egypt. I don’t want that operator to ever be able to be compelled to decrypt that data. If the AP server has that stuff then it could be. I want all that stuff… Right now we don’t have an encrypted way of transmitting the full thing, I want us to move that control over time to be as close to the user as possible - desktop, phone. not on some server. Maybe in the beginning it might be.

Q: only the user can access objects, only the user has the key? if an instance goes down and only the user has access to their posts…

you would have access to their posts too, you would send out this URN, in your AP object you have all the metadata and object. Instead of that you’d have idsc.digest.algorithm.key

Q: you can get the data if you have the whole thing and you can keep the data without knowing whats insde. What’s with the server that sends around these messages, the server will always know the key?

Right now that’s right, that’s a problem that we could solve… we have to solve it in different layers. That’s an AP problem of messages are sent in cleartext to each other. if we sent encrypted messages to each other we could store the object in that and it would solve the problem.

Q: We’re not using just sending plaintext messages because we don’t want to care about all the problems of end to end encryption with key material and multiple clients, we have to trust the instance because it’s lazy. It’s a completely different question. The question I wanted to ask is you say datashards chops up the data into packages and stores them, does the if you put the digest into the algorithm does this will the algorithm output what data which of these packets we have to combine. The whole data has the digest, how are these chopped packages…

We create a manifest document that specifies where all of the different pieces are. It says the size of the file, and here is where to get all the different pieces. We’re hoping to converge with digital bazaar.

Q: the manifest…

The entry point is the manifest unless the file is smaller than two because then there’s no point in a manifest.

Q: the chunks can be stored in different locations?

Yeah

… so you tried to separate routing from the main datashards storage spec. Do you also not care about redundancy?

Right now we don’t specify any redundancy layer. I specify that we could add some special metadata to the manifest and we’ve talked about what that might look like and that might be something we can add to the published version. It’s p0, it’s not done. We think by 1 we’d have… two flaws currently. One is that have not yet specified what happens when you have two many chunks to fit in one manifest. That’s an easy problem, we haven’t solved it yet. This is just we haven’t doen it yet, not computer science hard. Second problem is we can easily replicate chunks and use parity bits and all kinds of things, the problem is what happens if your initial entry point is lost? If you’re initial manifest is lost we have no solution currently. We’re looking into what we might be able to do about that.

In essence we could rely on default redundancy through people sharing.

Chris got most inspiration from tahoe-lafs, I got most of mine from freenet. Freenet does solve this problem, was insistant that in their system something like 5% data loss so insistant you should… freenet’s problem is not necessarily this system’s problem. Their design goals are different than ours.

Q: it’s really similar to how bittorrent works

Yeah it’s similar. The original versionw as called mag?? because we used magnet urls. A lot of these are not new ideas… I implemented this thing. I have a working version fo the current spec. There are two implementations of it, Chris’s and mine.

Q: wasn’t there something at rwot about this as well?

We did a cool demo where chris took a photo, drew on the photo, uploaded the image to their instance that was running racket, we then generated a qr code, I took my client running python, held it up to chris’s screen to the qr code, retrieved it and reassembled it, it took half a second to do the whole exchange. Everyone was like oooooh. The big thing was that we had two completely separate implementations that function.

Q; my interest in the advantages, but first mutable datashards?

The problem here is the data can’t b echanged, you have a hash it cannot be modified. Chris built something on top. Mutable datashards is a little scarier. The specification, the reason I don’t have an implementation so that they threw away the implementation and rebuilding it… it uses elliptical curve encryption, we’re sticking the entire public key into the… the key is the URN. The capability is that one person will have the right capability, the right key, the verified key will be a signature verification and the read, which is whacky, is the verify hashed, that document points to another idsc. It says I’m the current version, I’m version 7, I point over here, version 8 points over there. That’s it. It’s really simple. Here’s where it could be really useful for AP.

We could put DIDs on top of it. You could stick your DID document right in there. The big secret - we don’t need DIDs, you could just have this as your profile. My actor identifier is my mdsc, it says go ot this server. If your instance goes down no problem point to my new instance. We don’t need DIDs, we can get the same benefits without any DIDs or any blockchains. There is a challenge that is it’s unclear what the distribution mechanism for mdsc vs idscs. IDSCs it’s quite obvious to me, for AP gossip is the correct… I’m personally leaning toward DHTs for the mutable because I’m concerned that the possibility of when users move to a new instance they won’t be able to find one another. You need discoverability.

Q: mutable datashards - because you always use some abbreviations from the datashards world… as far as I understood it… you have an identifier and incrementing counter and a key

The document has the counter. The object stays the same. The object is always the same key you don’t have to know what version it is, it’s like git.

Q: for the writing you use

the write capability

Q: elliptic curve private key?

your private key is your write key. You use the hash of the write key for reading that. All of those capabilities, read/verify/write, are static and don’t change. What they point to will change.

Q: how do we change where they point to?

You’ll need a server, this is trickier. you need to check … I’m a server, I need to say do I have this key, you’re going to use your verify capability. You would modify a document, then the upload process is that you’re going to upload it to the verify key but it’s signed with your private key. The downside is it’s more susceptable to attack. let’s say your attacker didn’t want you to be able to move to a new instance, they could stop propagating your updates or pretend they don’t have that key. More possibilities of destructure… also a bit of delay because there’s going to be some time for that information to propagate. it’s not perfect but it’s better than the nothing we have now.

Q: another interesting idea would have been but it’s not deterministic is to just hash over the public key and the version number

that’s what freenet does

Q: …the nice thing about cryptographic hashes, if you just change the version number the outcome is vastly different, just use this as the locator

I think it’s batshit insane, you start downloading all kinds of versions because you don’t know which version number it is.

Q: number 3 isn’t in the network, you never get to number 4

I don’t think it’s terrible, I jus tthink it’s a lot of work. I think it’s unnecessary for this application.

Q: the downsides of all of this, may already tie into the next session, currently the nice thing about the current state of the fediverse is is it’s simple, it’s web technology. It’s broken sometimes, but what is not broken is simple. It can be used with modern technology. What is usable uses some widely available protocols and technologies in the browser and don’t need to write extra clients to do extra encryption or peer to peer.

I think we can do a lot of this using gossip. IDSCs can be done via gossip. I’m not convinced we need to do DHTs for mdscs but we should explore it. The bigger question is going to get back to the interface shit. For AP we’re really luck because files dont’ matter, but for other applications we’ve been thinking about what it looks like to send an object, you don’t want to send a long cryptographic key, you want a petname.

I’m sick of seeing bitcoin wallets. People are don’t worry I printed it out on a qr code… are you sure it’s what you think it is? i can’t read qr codes with my eyes. I think we need some interface work there.

For AP we don’t. We have JSON-LD to tell us what kind of object it is, image, note, so from our perspective that problem disappears, we just get a bag of bites, that’s okay for this.

Q: if we now move URI scheme of AP over to datashards then browsers know http(s) but they don’t know datashards, either retreival has to still be done on the backend instance server, or we need a dedicated client architecture.

You need a dedicated client application but it doesn’t need to speak a non-http protocol. The first version that chris wrote was an incomplete but functional http server serving immutable datashards. All you have to do is say here’s the URN and say that’s what I’m looking for and it goes…

Q: like a proxy?

Not a proxy, it’s the only protocol, it’s all just http underneath. You’re making it harder than it has to be.

There’s no other thing, no other network, http is (or can be) the communication protocol. My current implementation has a bunch of http requests that says I’m announcing this key, retrieving this key, it’s really simple.

Q: this is the api mechanism behind the uri scheme?

It’s the api scheme for storage to talk to clients and each other.

Q: we could just implement something like this in pleroma to save the image so if a server goes down the images don’t go down. Just as a file.

I’m scared because you haven’t trolled me once…

Q: I think overall it’s a good idea, makes sense.

Chris and I need to keep building. Chris believed that you only needed GET and PUT, I said no for AP maybe, but Chris wanted to make it more general, so then you need a few more verbs. I’m documenting all that, I’m also documenting another instance, and working with Chris on the mutable part. I don’t think we need the mutable part until later. We could do all the immutable stuff without doing the mutable stuff. There are some concerns that were brought up based on freenet. Their biggest concern was … not storage, but your bandwidth is and your i/o is going to be, especially if you’re using spindles. It would potentially change how much work we expect an AP server to do. I don’t know if that’s a big deal or not. We can’t really know because we’re only starting to look at simulations. They’ll only tell you so much, you have to deploy it.

Q: i would expect in a use case like AP most of the objects request would be the same time frame because the new post gets out everyone wants to look at it, could probably keep the most recent 500mb in ram and server form that

People might say their bandwidth has increased. but overall bandwidth should decrease. Who remembers the slashdodt effect? This is if something gtes really cpopularl everybody wants to download it iso you’re pubished for having popular content. We punish you for having things people want which is really crazy. We could get away from that if we all share the load.

We need to do more work, my goal is to make several library implementations so it becomes… I’ll admit probably not erlang… Maybe you start on the server and at some point someone is going to say you should do it in the client and I"ll say I told you so… I think for now it might be fine because it doens’t actually change the threat model from today. All we’re doing is offering the ability to get better in the future.

Q: one thing is for images, additional to the normal url we say a datashards urn.

I agree that you might need a transition period. I would like to discuss what thatl ooks like. Chris and I spent a lot of time talking about what the transition looked like. Talked with cj from gofed. Cj’s position is you shouldn’t offer any transition, that will push people to implement.

Q: yes but only if mastodon implements it

That’s why I tried to talk to nightpool about this, who was like ewell I don’t know … I said you could do it on the server but you don’t get any benefits. Everyone can do it on the server and then realise five years down the road that it was a terrible idea and move it to the client. But maybe to start with we start doing oit on the server. It still solves the node going down or being unavailable problem, we just don’t solve the sensitivity issue.

I want to keep this conversation going but let’s start wrapping up.

Q: how do I delete stuff?

Really good question. You can delete things from your own store. You can ask lain to delete… You can’t compel me to delete data.

Q: and when crypto is broken in 30 years…

I’m surprised no-one brought that up, we think about it a lot. We are thinking about what it would look like to request that you go from v2 to v3, how do you tell stores, but we haven’t figured out a nice solution but we want to spend more time exploring that, we know it’s a problem. If your data is really that sensitive, either wrap it up in multiple different cyrpto, or don’t put it on the internet. Keep it on a USB key.

Q: you cannot put the id of the object into the object any more.

Yeah that would be a change.

datashards.net


See also
“ActivityPub: past, present, future” - Keynote by Christopher Lemmer Webber

In general the Tombstone class can be used to replace the original object’s class so that the system knows about the object, but does not serve it. I guess there will be a dedicated topic on deletion some time in #activitypub, to provide thorough information about best practices.

I did write such a topic already…

Indeed

1 Like