17 May 2018

Price and cite data

These are my notes on the papers:

  1. Data Markets in the Cloud: An Opportunity for the Database Community. Magda Balazinska, Bill Howe, and Dan Suciu, VLDB 2011.
  2. Why Data Citation is a Computational Problem? Peter Buneman, Susan Davidson, James Frew, CACM Sept. 2016.


The general idea in both papers is to attach more labels to some data, or more generally, to a pair of (query, database). The labels, in this case, are prices and citations.

Strong points:

  1. New ideas are needed how to distribute credits to authors of not only papers but also to the creators of data that are used to do research. One interesting notion is of transitivity of credit in citation, as inspired by PageRank.
  2. Cloud based data market is happening but it seems to be still in its infancy. The valuable points are about deeply understanding how the value of data is modified during data transformations, integration, and usage, and in developing pricing/citation models/rules, supporting tools, and services to facilitate a cloud based data market as well as the citation propagation.
  3. It seems to be a natural idea and probably most viable to set prices at the granularity of tuples (or even values within a tuple), and then extend the relational algebra based query executor to take into account prices within each operator. On the other hand, such an approach seems to be trivial, for example, when we join two tuples in a join operator, we sum the prices of the corresponding tuples to generate a price for the output tuples. It might be a bit tricky to define a simple rule for aggregation. One can imagine a simple rule for a 2 column output of a group by, where the group value is simply a sum of the grouped in rows for sum value but for a max with a group by, a price for each group would correspond with the price of the row with the max value (within a group). Then, we would have to think about the limits in a query (top 10 results), or how to cost a query that returns no results.
  4. A price-model tuning advisor seems to be a bit far-fetched idea, though interesting. It is primarily on a higher level where we want to find a good enough pricing model for a given workload and dataset. It is similar in spirit to the Database Engine Tuning Advisor (DTA) from Microsoft SQL Server. The price-model tuning advisor would not only determine what pricing model to adopt to maximize profit but also produce income estimates.
  5. For a consumer, it is crucial to know the estimated charges based on a given pricing model. However, with a fine-grained prices (on the level of rows or values) such estimates can be as bad as some of the query plans returned by a query optimizer.
  6. A better and more realistic idea would be based on bid and ask price, so that the buyer and seller of data would be able to negotiate the price.

Weak points:

  1. The rule-based language seems to be the first thing to explore, especially in the context of finding patterns in the data and query pairs where the same label (price or citation) can be attached. However, the rule based expert systems were replaced with much easier to create and giving better results - machine learning models. I can see the price and the citation as not a crisp labels that are deterministic but as an approximation of what we expect, thus a machine learning model to generate such metadata could be a good next step to try.
  2. The main barrier for the pricing and citation systems can be that the data needed to generate prices or citation is simply missing, either in the database itself or in a metadata repository.

More notes:

  1. Construction of an argument that we need queries and data to generate a citation: Even if the query returns nothing, it may be worthy of citation, but what citation is associated with the empty set? We need at least context information; so we need both Q (query) and D (data/database).
  2. We will need different types of citations, for instance, on Google Scholar to cite articles but also data items generated by specific researchers. The h-index and other measures will have to be adjusted to cater for the broader view of idea generation and execution.