Postgres Indecisiveness Plugin

PIP is a plugin for Postgres (being incorporated into MayBMS) that implements a probabilistic database supporting continuous distributions. PIP implements a hybrid of the C-Tables and VG-Functions approaches to probabilistic databases, allowing it to employ a sampling optimizer to ensure that statistical metrics are computed as efficiently as possible. More details on PIP's implementation can be found in our ICDE 2010 paper

This document is structured as follows:

Installing PIP
Using PIP

Creating Probabilistic Data
Writing Probabilistic Queries - With CType Extensions
Writing Probabilistic Queries - Without CType Extensions
Defining Probability Distributions

Reference: Expectation Operators
Reference: Predefined Distributions
Reference: Distribution Components
Reference: Distribution Utility Functions

Installing PIP

PIP is included as part of the MayBMS Distribution.

http://maybms.sourceforge.net

PIP's core functionality is embedded in a standard Postgres plugin and may be used with any compatible installation of Postgres (it has been verified to work under Postgres 8.4). PIP also includes query-rewriting functionality that makes it possible to write queries that treat probabilistic data as regular data -- this allows for nearly seamless integration of probabilistic data into existing workflows. However, this rewriting functionality requires building a copy of Postgres with PIP's CType extensions. If you do not wish to use these extensions, skip directly to Installing the PIP Library.

Installing Postgres-CType

Create a copy of MayBMS with the CType extensions by running the patch script.
```
$> cd maybms
$> ./pip_plugin/scripts/patch_postgres.sh
```

Build, install, and set up the patched MayBMS as normal. For example, on a standard configuration:

$> cd postgresql-ctype
$> ./configure
$> make
$> sudo make install
$> export PGDATA=[/usr/share/postgres_db]
$> export PATH=$PATH:/usr/local/postgres/bin
$> mkdir $PGDATA
$> initdb

Installing the PIP Library

Compile PIP.
```
$> cd pip_plugin
$> make
```
Install PIP. By default, PIP Installs the PIP library into /usr/local/lib/pgsql. To change the default install directory, modify the INSTALL_DIR variable in the Makefile.
```
$> sudo make_install
```
Configure Postgres to use PIP. This step needs to be run once for each database that you intend to use PIP with. The makefile generates two scripts to automate this process: install.sql and install.ctype.sql, for use with standard postgres installations and installs with the CType patch described above, respectively.
```
$> psql [my_database] -f install[.ctype].sql
```
To add PIP functionality to all databases created in the future, run the setup script (as above) on the database 'template_1'. Note that this does not affect any existing databases.
```
$> psql template_1 -f install[.ctype].sql
```

Using PIP

Creating Probabilistic Data

All probabilistic data in PIP is expressed using the pip_eqn datatype. New probabilistic data is introduced through the CREATE_VARIABLE function. This function takes 2 parameters: the name of a distribution, and a vector of parameters to that distribution.

For example:

CREATE TABLE my_probabilistic_data(
  name varchar(100), 
  data pip_eqn
);
INSERT INTO my_probabilistic_data
  SELECT input.name, 
         CREATE_VARIABLE("Normal", vector(input.mean, input.stddev))
  FROM   input;

This example creates a random variable by iterating over each row of the table 'input'. The Normal distribution requires two parameters: a mean and a standard deviation, both of which are drawn from the corresponding row of the input table.

PIP comes with several probability distributions predefined, described below. The name of the distribution is passed as the first parameter to the CREATE_VARIABLE function.

Writing Probabilistic Queries - With CType Extensions

Queries over probabilistic data fall into two stages. With the CType extensions, the first stage is nearly identical to querying normal data -- write your queries as you would for deterministic data. For example:

CREATE TABLE source_1(int id, measurement double precision, data pip_eqn);
CREATE TABLE source_2(int id, data pip_eqn);
CREATE TABLE source_3(int id, data pip_eqn);
-- fill source_1, 2, and 3
CREATE TABLE results AS
  SELECT source_1.id, source_1.measurement, source_3.data
  FROM   source_1, source_2, source_3
  WHERE  source_1.id = source_2.id
    AND  source_2.id = source_3.id
    AND  source_1.data > source_2.data + source_3.data;

However, the result of this query will not be a numerical result, but rather a compressed representation of the probabilistic formula that defines each cell of the result. PIP provides several operators to transform the result into a comprehendable form. These are defined below.

Writing Probabilistic Queries - Without CType Extensions

Without using the CType extensions, users must be aware of some of PIP's inner workings. Specifically, constraints over probabilistic data (e.g., source_1.data > source_2.data + source_3.data) can not be reduced to booleans (as the result of the inequality is itself probabilistic), and must be included in the query result as data. The CType extensions rewrite queries so that the the query results are modified automatically, but without them you must write your queries accordingly. These comparisons are of type pip_atom.

What this means for you: Without the CType extensions, queries must be written with comparisons over probabilistic data in the SELECT clause and not the WHERE clause. For example, the exampe query above must be rewritten as:

CREATE TABLE source_1(int id, measurement double precision, data pip_eqn);
CREATE TABLE source_2(int id, data pip_eqn);
CREATE TABLE source_3(int id, data pip_eqn);
-- fill source_1, 2, and 3
CREATE TABLE results AS
  SELECT source_1.id, source_1.measurement, source_3.data,
         source_1.data > source_2.data + source_3.data
  FROM   source_1, source_2, source_3
  WHERE  source_1.id = source_2.id
    AND  source_2.id = source_3.id;

Defining Probability Distributions

Users define new probability distributions by declaring a function (in C) for generating samples from the distribution, as well as several metadata functions that the PIP sampling optimizer can use to improve sampling efficiency. New distributions can be introduced as follows:

Create a new .c file in pip_plugin/src/dist, which should #include "#pip.h".

Add the following code to the file:

DECLARE_PIP_DISTRIBUTION([shortname]) = {
  .name = [stringname],
  .size = [paramsize],
  .init = [init_fn],
  .gen =  [gen_fn],
  .pdf =  [pdf_fn],
  .cdf =  [cdf_fn],
  .icdf=  [icdf_fn],
  .out =  [output_fn],
  .in  =  [input_fn],
  .joint= false
}

See below for definitions of each bracketed term.

Define the necessary functions as described below.
Compile and install PIP as normal. You may be prompted to upgrade your PIP install.

Reference: Expectation Operators

expectation(var, row)

Compute the expectation of the specified variable. For technical reasons, you must also provide a reference to the table that the variable appears in.

SELECT results.id, expectation(results.data, results)
FROM   results;

As a shorthand for the expectation function, you may use double angle brackets.

SELECT results.id, << results.data @ results >>
FROM   results;

conf_one(row)

Compute the confidence of a row, or probability that the row is present in the database; Such nondeterminism can arise as the result of comparisons involving probabilistic data (e.g. source_1.data > source_2.data + source_3.data).

SELECT results.id, conf_one(results)
FROM   results;

expectation_sum(var, row)

Compute the expectation of the sum aggregate over the specified probabilistic data column. For technical reasons, you must also provide a reference to the table that the column appears in.

SELECT expectation_sum(results.data, results)
FROM   results;

sum(value, row)

Compute the expectation of the sum aggregate over the specified non-probabilistic data column. This function should replace use of the standard sum aggregate. For technical reasons, you must also provide a reference to the table that the column appears in.

SELECT sum(results.measurement, results)
FROM   results;

expectation_max(var, row)

Compute the expectation of the max aggregate over the specified column. For technical reasons, you must also provide a reference to the table that the column appears in.

SELECT expectation_max(results.data, results)
FROM   results;

max(value, row)

Compute the expectation of the max aggregate over the specified non-probabilistic data column. This function should replace use of the standard max aggregate. For technical reasons, you must also provide a reference to the table that the column appears in.

SELECT max(results.measurement, results)
FROM   results;

Reference: Predefined Distributions

Zero

A distribution that is always zero (i.e., the Dirac Delta as a distribution).

CREATE_VARIABLE("Zero", vector())

Exponential

The Exponential Distribution. The Exponential Distribution takes one parameter, lambda: the inverse of the expectation of the variable being created.

CREATE_VARIABLE("Exponential", vector(lambda))

Normal

The Normal Distribution. The Normal Distribution takes two parameters, the mean, and the standard deviation of the variable being created.

CREATE_VARIABLE("Normal", vector(mean, stddev))

Poisson

The Poisson Distribution. The Poisson distribution takes one parameter, lambda: the expectation of the variable being created.

CREATE_VARIABLE("Poisson", vector(lambda))

Uniform

The Uniform Distribution. The uniform distribution takes two parameters, the endpoints of the distribution; The endpoints may be provided in any order.

CREATE_VARIABLE("Uniform", vector(low, high))

Reference: Distribution Components

The components of a PIP distribution are as follows:

shortname

A short, unique, one-word name for the distribution. The short name must be a valid c identifier

stringname

A C string containing a human-readable name for the distribution. This is the same name that will be used when an end-user calls CREATE_VARIABLE.

paramsize

The number of bytes required to store this distribution's parameters. For example, if your distribution is parametereized by two floats, this parameter should be 2 * sizeof(float).

init_fn

A pointer to a constructor function for your distribution with the following schema:

void init_fn(pip_var *var, HeapTupleHeader params)

When your initializer is invoked, var will contain a pointer to initialized pip variable. var->group_state will contain a pointer to an allocated, but uninitialized block of [paramsize] bytes. params is a pointer to the postgres vector() of parameters. See below for information on utility functions for parsing the parameter vector.

gen_fn

A pointer to a generator function for your distribution with the following schema:

float8 gen_fn(pip_var *var, int64 seed)

The generator function returns a value sampled from the distribution being defined, with parameters stored in var->group_state (as defined above). This function MUST be deterministic; Randomness is obtained from the randomly selected seed parameter. See below for information on utility functions for generating random numbers from this seed value.

pdf_fn

NULL, or a pointer to a function for computing the probability density function of the distribution being defined, with the following schema:

float8 pdf_fn(pip_var *var, float8 point)

The pdf function returns the probability density at the indicated point of the distribution being defined, with parameters stored in var->group_state (as defined above). If this pointer is NULL, optimizations involving the distribution's PDF will be ignored by PIP's sampling optimizer.

cdf_fn

NULL, or a pointer to a function for computing the cumulative distribution function of the distribution being defined, with the following schema:

float8 cdf_fn(pip_var *var, float8 point)

The cdf function returns the probability of selecting a sample less than or equal to the indicated point, given the parameters stored in var->group_state (as defined above). If this pointer is NULL, optimizations involving the distribution's CDF will be ignored by PIP's sampling optimizer.

icdf_fn

NULL, or a pointer to a function for computing the inverse of the cumulative distribution function of the distribution being defined, with the following schema:

float8 icdf_fn(pip_var *var, float8 probability)

The inverse cdf function returns a point in the domain of the distribution such that the likelihood of sampling a value less than or equal to the point is equal to the indicated probability (which is guaranteed to be between 0 and 1, inclusive), given the parameters stored in var->group_state (as defined above). If both [cdf_fn] and [icdf_fn] are provided, then it MUST be true that

x = cdf_fn(var, icdf_fn(var, x));

If this pointer is NULL, optimizations involving the distribution's inverse CDF will be ignored by PIP's sampling optimizer.

output_fn

NULL, or a pointer to a function for generating a human-readable representation of the distribution's parameters, with the following schema:

int input_fn(pip_var *var, int len, char *str)

When invoked, str will contain a pointer to a (large) allocated, but uninitialized block of memory of size len this function should fill with a C string containing a human-readable representation of the distribution's parameters (e.g., using snprintf), which should be stored in var->group_state (as defined above). The function should return the length of the string, not counting the trailing null character (as snprintf). Note: Although this function is optional, be aware that not including it will prevent users from being able to export data defined in terms of this distribution in the more commonly compatible text format, potentially preventing you from being able to upgrade your PIP install. Consequently, this function MUST be included in any production system.

input_fn

NULL, or a pointer to a function for parsing a human-readable representation of the distribution's parameters, with the following schema:

int input_fn(pip_var *var, char *str)

When the input function is invoked, var will contain a pointer to initialized pip variable. var->group_state will contain a pointer to an allocated, but uninitialized block of [paramsize] bytes. str references a C string containing a human-readable representation of this distribution's parameters, as used in [output_fn]. The input function should parse this string (e.g., using sscanf) and initialize var->group_state in the same way as [init_fn]. Note: Although this function is optional, be aware that not including it will prevent users from being able to import data defined in terms of this distribution.

Reference: Distribution Utility Functions

PIP includes a library of utility functions for use in defining distributions. These methods are defined in pip_plugin/src/include/dist.h (which will be included as part of "pip.h")

float8 dist_param_float8(HeapTupleHeader params, int n, float8 default): Utility function for parsing a Postgres vector (e.g., in [init_fn]). Returns the nth element (starting with 0) of params, cast and/or translated into a float8. If the params has fewer than n+1 elements, the default value will be returned instead.
int64 pip_prng_step(int64 seed): The central operation for generating random numbers, for use in a distribution's generator function. Takes a seed value and returns the next (random but deterministic) seed value.
int64 pip_prng_int(int64 *seed): The canonical way to generate a random integer in a distribution's generator function. Takes a seed value, and returns a random (but deterministic based on seed) integer value between 0 and 2^63-1. The seed value is passed by reference, and is stepped to its next value automatically.
float8 pip_prng_float(int64 *seed): The canonical way to generate a random float in a distribution's generator function. Takes a seed value, and returns a random (but deterministic based on seed) float value between 0 and 1. The seed value is passed by reference, and is stepped to its next value automatically.
void pip_box_muller(float8 *X, float8 *Y, int64 *seed): Generate two random (but deterministic based on seed), independent, normally distributed floating point numbers (with mean of 0 and standard deviation of 1) using the Box-Muller method. The independent variables will be stored in X and Y -- either or both may be used. The seed value is passed by reference, and is stepped to its next value automatically (note that the Box-Muller method requires two random numbers as input and as a consequence, seed is actually stepped twice).