PIP is a plugin for Postgres (being incorporated into MayBMS) that implements a probabilistic database supporting continuous distributions. PIP implements a hybrid of the C-Tables and VG-Functions approaches to probabilistic databases, allowing it to employ a sampling optimizer to ensure that statistical metrics are computed as efficiently as possible. More details on PIP's implementation can be found in our ICDE 2010 paper
This document is structured as follows:
PIP is included as part of the MayBMS Distribution.
http://maybms.sourceforge.net
PIP's core functionality is embedded in a standard Postgres plugin and may be used with any compatible installation of Postgres (it has been verified to work under Postgres 8.4). PIP also includes query-rewriting functionality that makes it possible to write queries that treat probabilistic data as regular data -- this allows for nearly seamless integration of probabilistic data into existing workflows. However, this rewriting functionality requires building a copy of Postgres with PIP's CType extensions. If you do not wish to use these extensions, skip directly to Installing the PIP Library.
$> cd maybms $> ./pip_plugin/scripts/patch_postgres.sh
$> cd postgresql-ctype $> ./configure $> make $> sudo make install $> export PGDATA=[/usr/share/postgres_db] $> export PATH=$PATH:/usr/local/postgres/bin $> mkdir $PGDATA $> initdb
$> cd pip_plugin $> make
$> sudo make_install
$> psql [my_database] -f install[.ctype].sql
$> psql template_1 -f install[.ctype].sql
All probabilistic data in PIP is expressed using the pip_eqn datatype. New probabilistic data is introduced through the CREATE_VARIABLE function. This function takes 2 parameters: the name of a distribution, and a vector of parameters to that distribution.
For example:
CREATE TABLE my_probabilistic_data(
name varchar(100),
data pip_eqn
);
INSERT INTO my_probabilistic_data
SELECT input.name,
CREATE_VARIABLE("Normal", vector(input.mean, input.stddev))
FROM input;
This example creates a random variable by iterating over each row of the table 'input'. The Normal distribution requires two parameters: a mean and a standard deviation, both of which are drawn from the corresponding row of the input table.PIP comes with several probability distributions predefined, described below. The name of the distribution is passed as the first parameter to the CREATE_VARIABLE function.
Queries over probabilistic data fall into two stages. With the CType extensions, the first stage is nearly identical to querying normal data -- write your queries as you would for deterministic data. For example:
CREATE TABLE source_1(int id, measurement double precision, data pip_eqn);
CREATE TABLE source_2(int id, data pip_eqn);
CREATE TABLE source_3(int id, data pip_eqn);
-- fill source_1, 2, and 3
CREATE TABLE results AS
SELECT source_1.id, source_1.measurement, source_3.data
FROM source_1, source_2, source_3
WHERE source_1.id = source_2.id
AND source_2.id = source_3.id
AND source_1.data > source_2.data + source_3.data;
However, the result of this query will not be a numerical result, but rather a compressed representation of the probabilistic formula that defines each cell of the result. PIP provides several operators to transform the result into a comprehendable form. These are defined below.
Without using the CType extensions, users must be aware of some of PIP's inner workings. Specifically, constraints over probabilistic data (e.g., source_1.data > source_2.data + source_3.data) can not be reduced to booleans (as the result of the inequality is itself probabilistic), and must be included in the query result as data. The CType extensions rewrite queries so that the the query results are modified automatically, but without them you must write your queries accordingly.
These comparisons are of type pip_atom.
What this means for you: Without the CType extensions, queries must be written with comparisons over probabilistic data in the SELECT clause and not the WHERE clause. For example, the exampe query above must be rewritten as:
CREATE TABLE source_1(int id, measurement double precision, data pip_eqn);
CREATE TABLE source_2(int id, data pip_eqn);
CREATE TABLE source_3(int id, data pip_eqn);
-- fill source_1, 2, and 3
CREATE TABLE results AS
SELECT source_1.id, source_1.measurement, source_3.data,
source_1.data > source_2.data + source_3.data
FROM source_1, source_2, source_3
WHERE source_1.id = source_2.id
AND source_2.id = source_3.id;
Users define new probability distributions by declaring a function (in C) for generating samples from the distribution, as well as several metadata functions that the PIP sampling optimizer can use to improve sampling efficiency. New distributions can be introduced as follows:
.c file in pip_plugin/src/dist, which should #include "#pip.h".
DECLARE_PIP_DISTRIBUTION([shortname]) = {
.name = [stringname],
.size = [paramsize],
.init = [init_fn],
.gen = [gen_fn],
.pdf = [pdf_fn],
.cdf = [cdf_fn],
.icdf= [icdf_fn],
.out = [output_fn],
.in = [input_fn],
.joint= false
}
See below for definitions of each bracketed term.expectation(var, row)SELECT results.id, expectation(results.data, results) FROM results;As a shorthand for the expectation function, you may use double angle brackets.
SELECT results.id, << results.data @ results >> FROM results;
conf_one(row)source_1.data > source_2.data + source_3.data).SELECT results.id, conf_one(results) FROM results;
expectation_sum(var, row)SELECT expectation_sum(results.data, results) FROM results;
sum(value, row)SELECT sum(results.measurement, results) FROM results;
expectation_max(var, row)SELECT expectation_max(results.data, results) FROM results;
max(value, row)SELECT max(results.measurement, results) FROM results;
CREATE_VARIABLE("Zero", vector())CREATE_VARIABLE("Exponential", vector(lambda))CREATE_VARIABLE("Normal", vector(mean, stddev))CREATE_VARIABLE("Poisson", vector(lambda))CREATE_VARIABLE("Uniform", vector(low, high))The components of a PIP distribution are as follows:
shortnamestringnameCREATE_VARIABLE.paramsize2 * sizeof(float).init_fnvoid init_fn(pip_var *var, HeapTupleHeader params)When your initializer is invoked,
var will contain a pointer to initialized pip variable. var->group_state will contain a pointer to an allocated, but uninitialized block of [paramsize] bytes. params is a pointer to the postgres vector() of parameters. See below for information on utility functions for parsing the parameter vector.gen_fnfloat8 gen_fn(pip_var *var, int64 seed)The generator function returns a value sampled from the distribution being defined, with parameters stored in
var->group_state (as defined above). This function MUST be deterministic; Randomness is obtained from the randomly selected seed parameter. See below for information on utility functions for generating random numbers from this seed value.pdf_fnfloat8 pdf_fn(pip_var *var, float8 point)The pdf function returns the probability density at the indicated
point of the distribution being defined, with parameters stored in var->group_state (as defined above). If this pointer is NULL, optimizations involving the distribution's PDF will be ignored by PIP's sampling optimizer.cdf_fnfloat8 cdf_fn(pip_var *var, float8 point)The cdf function returns the probability of selecting a sample less than or equal to the indicated
point, given the parameters stored in var->group_state (as defined above). If this pointer is NULL, optimizations involving the distribution's CDF will be ignored by PIP's sampling optimizer.icdf_fnfloat8 icdf_fn(pip_var *var, float8 probability)The inverse cdf function returns a point in the domain of the distribution such that the likelihood of sampling a value less than or equal to the point is equal to the indicated
probability (which is guaranteed to be between 0 and 1, inclusive), given the parameters stored in var->group_state (as defined above). If both [cdf_fn] and [icdf_fn] are provided, then it MUST be true that x = cdf_fn(var, icdf_fn(var, x));If this pointer is NULL, optimizations involving the distribution's inverse CDF will be ignored by PIP's sampling optimizer.
output_fnint input_fn(pip_var *var, int len, char *str)When invoked,
str will contain a pointer to a (large) allocated, but uninitialized block of memory of size len this function should fill with a C string containing a human-readable representation of the distribution's parameters (e.g., using snprintf), which should be stored in var->group_state (as defined above). The function should return the length of the string, not counting the trailing null character (as snprintf). Note: Although this function is optional, be aware that not including it will prevent users from being able to export data defined in terms of this distribution in the more commonly compatible text format, potentially preventing you from being able to upgrade your PIP install. Consequently, this function MUST be included in any production system.
input_fnint input_fn(pip_var *var, char *str)When the input function is invoked,
var will contain a pointer to initialized pip variable. var->group_state will contain a pointer to an allocated, but uninitialized block of [paramsize] bytes. str references a C string containing a human-readable representation of this distribution's parameters, as used in [output_fn]. The input function should parse this string (e.g., using sscanf) and initialize var->group_state in the same way as [init_fn]. Note: Although this function is optional, be aware that not including it will prevent users from being able to import data defined in terms of this distribution.
PIP includes a library of utility functions for use in defining distributions. These methods are defined in pip_plugin/src/include/dist.h (which will be included as part of "pip.h")
float8 dist_param_float8(HeapTupleHeader params, int n, float8 default)nth element (starting with 0) of params, cast and/or translated into a float8. If the params has fewer than n+1 elements, the default value will be returned instead.int64 pip_prng_step(int64 seed)seed value and returns the next (random but deterministic) seed value.int64 pip_prng_int(int64 *seed)seed) integer value between 0 and 2^63-1. The seed value is passed by reference, and is stepped to its next value automatically.
float8 pip_prng_float(int64 *seed)seed) float value between 0 and 1. The seed value is passed by reference, and is stepped to its next value automatically.
void pip_box_muller(float8 *X, float8 *Y, int64 *seed)seed), independent, normally distributed floating point numbers (with mean of 0 and standard deviation of 1) using the Box-Muller method. The independent variables will be stored in X and Y -- either or both may be used. The seed value is passed by reference, and is stepped to its next value automatically (note that the Box-Muller method requires two random numbers as input and as a consequence, seed is actually stepped twice).