Data Markets

How to buy what you can't see, sell what you can't show, and price what you can't measure

There is a trillion dollars of data that will never be traded. The buyers exist. The sellers exist. What doesn't exist is a sane way to transact between them. I want to sketch why this problem matters and why it's becoming increasingly solvable.

Three Impossible Things Before Breakfast

Suppose you have data $D$ and I have a task $M$ (a model, an evaluation procedure, whatever). We'd both like to know whether your data is useful for my task—whether there exists some value $f(D,M) > 0$ that would justify a transaction.

The trouble is:

You won't show me your data. And why would you? Once I've seen it, I've already extracted the value. You have no leverage (See Arrow's Paradox).
I won't show you my model. My task reveals my strategy. If you knew what I was trying to do, you could build it yourself, or sell the insight to my competitors.
Neither of us knows the price. The value of data is task-dependent. The same dataset might be worth millions to me and nothing to you. Without adjudicating fit, we are both somewhat blind.

The Cryptographic Turn

The obvious first instinct is cryptography. And here's what's exciting: the tools to solve this now exist, at least in principle.

The mathematical structure of our problem is clean. We want to jointly compute some function $f(D,M)$ where:

The seller holds $D$ as a private input
The buyer holds $M$ as a private input
Both learn only the output $f(D,M)$ (or perhaps the buyer learns a bit more, enough to decide on purchase)

This is exactly the setup of secure multiparty computation (MPC). You have two parties, each with secrets, jointly computing a function without revealing their inputs. The theory is decades old and practice is now becoming feasible.

But which $f$ should we compute? Here the recent work on data valuation provides an answer. If the buyer's task is training or fine-tuning a model, we can use influence functions—first-order approximations of how adding a data point would change model performance:

I(z_s, z_{\text{eval}}) = -\nabla_\theta \ell(z_{\text{eval}}; \hat{\theta})^\top H^{-1} \nabla_\theta \ell(z_s; \hat{\theta})

This tells us: if we added the seller's data point $z_s$ to training, how much would the loss on the buyer's evaluation set decrease? It's a marginal utility function, computable from gradients.

The key insight is that the gradients can be projected into low-dimensional subspaces (via techniques like LoRA), and inner products on projected gradients can be computed homomorphically—on encrypted data, without either party seeing the other's input.

A Protocol Sketch

Here's what a working marketplace might look like:

Phase 1: Discovery. Seller publishes metadata such as schema, high-level statistics, broad descriptions. No raw data or no proprietary models are revealed. You just enough to establish “this might be relevant.”

Phase 2: Encrypted Trial. Buyer and seller agree on an evaluation metric. It could be anything ranging from “accuracy of my model on your data” to “expected lift on my task.” Then:

Buyer encrypts their evaluation gradients using homomorphic encryption (CKKS scheme for real-valued operations)
Seller encrypts their data gradients using the buyer's public key
An untrusted broker computes $f(D,M)$ entirely in ciphertext
Buyer decrypts to learn the score

The seller never sees the buyer's task. The buyer never sees the seller's data. The broker learns nothing.

Phase 3: Transaction. If the score justifies it, money changes hands. The buyer might get API access to query functions of the data, repeated encrypted evaluations, or a one-time export with legal protections.

The Hard Parts

This all sounds cleaner than it is. Some honest difficulties:

Performance. Homomorphic encryption is still slow and MPC requires coordination. For now, this works best for high-value, constrained computations and not “explore this entire dataset freely.”

Output leakage. Even if inputs are hidden, the outputs could leak information. Repeated queries like “what's my accuracy on your data?” with different models can gradually reveal the data distribution. I don't know how to solve this but you can try some combination of rate limits and noise injection. My sense is that they will not be sufficient.

Incentive design. Sellers can game metrics and buyers can attempt free-riding. The mechanism design layer i.e. how you structure prices, auctions, reputations is a separate hard problem.

Trust assumptions. Pure MPC assumes nothing which makes it heavy. Trusted execution environments (like Intel SGX) are lighter but require trusting hardware vendors.

This is part of my “things I want to see exist” series. I'm not working on this directly, but I think about it often. If you are working on it, I'd like to hear from you.