|
|
Myriad Data Generator Toolkit
|
|
|
=============================
|
|
|
|
|
|
*Myriad* is a development toolkit for scalable data generators. Generating large synthetic datasets with a certain schema and a set of statistical properties is a challenging yet increasingly important task, especially in the context of benchmarking and testing systems designed for management of web-scale data like [Hadoop](hadoop.apache.org) or parallel RDBMS like [DB2](www-01.ibm.com/software/data/db2/). The *Myriad Toolkit* aims to ease this process by offering a fast and easy way to develop data generators that can generate dependent data on independently running nodes.
|
|
|
*Myriad* is a development toolkit for scalable data generators. Generating large, synthetic datasets with a certain schema and a set of statistical properties is a challenging yet increasingly important task, especially in the context of benchmarking and testing systems designed for management of web-scale data or parallel RDBMS (e.g. [Hadoop](hadoop.apache.org), [DB2](www-01.ibm.com/software/data/db2/)). The *Myriad Toolkit* aims to simplify this process by offering a fast and easy way to develop data generators that can generate *dependent data* in parallel with a set of *independently running nodes*.
|
|
|
|
|
|
|
|
|
Core Features
|
... | ... | @@ -9,17 +9,17 @@ Core Features |
|
|
|
|
|
The *Myriad Toolkit* has two main components: a generic *C++ runtime library* for scalable data generation, and a *Python specification compiler* that generates library extensions from a user-defined data generator specification written in XML.
|
|
|
|
|
|
An XML specification contains the structure of the generated *domain model* - a family of user-defined *domain types*, as well as a corresponding family of *pseudo-random domain type generators (PRDGs)*.
|
|
|
An XML specification defines the structure of the generated *domain model* as a family of user-defined *domain types*, and the data generation logic as a corresponding family of *pseudo-random domain type generators (PRDGs)*.
|
|
|
|
|
|
Essentially, a PRDG is a function that generates a sequence of pseudo-random domain type records from an underlying sequence of pseudo-random numbers. In the XML specification PRDGs are realized as a chain of setter functions. Applying a setter to a generated record assigns (i.e. sets) a specific value to one or more of its components. The *Myriad Toolkit* provides a range of primitive *setters* that can be used to construct PRDGs with different statistical properties (e.g. value distributions for specific record fields or value dependencies between several record fields).
|
|
|
Essentially, a PRDG is a function that generates a sequence of pseudo-random domain type records from an underlying sequence of pseudo-random numbers. In the XML specification PRDGs are realized as a chain of *setter functions*. Applying a setter to a generated record assigns (i.e. sets) a specific value to one or more of its components. The *Myriad Toolkit* provides a range of primitive setters that implement various statistical properties (e.g. value distributions in a record fields or value dependencies between several record fields).
|
|
|
|
|
|
Besides the simple specification language, the *Myriad runtime library* also transparently provides parallelization support for all compiled data generators. The underlying sequence of pseudo-random numbers is partitioned in a way that identifies each pseudo-random record with a unique position (i.e. a concrete seed) in the number sequence. In addition, the sequence of pseudo-random numbers is generated by a pseudo-random number generator (PRNG) that supports arbitrary skips to any position in constant time. These runtime-level decisions are critical for efficient parallelization, as they allow us to (A) partition the generated PRDG sequences across an arbitrary number of data generator nodes in a shared-nothing environment, and (B) use *function shipping* (i.e. re-compute) instead of *data shipping* (ship over the network) to get the contents of a referenced record generated on a remote node.
|
|
|
Besides the simple specification language, the *Myriad runtime library* transparently builds-in parallelization support in all compiled data generators. To do so, the framework makes sure that the following two conditions always hold. First, each domain record is identified by a unique position (i.e. a concrete seed) in the generating pseudo-random number sequence. Second, the sequence of pseudo-random numbers is generated by a pseudo-random number generator (PRNG) function that supports arbitrary skips to any position on the sequence in constant time. These runtime-level decisions are critical for efficient parallelization, as they allow us to (A) partition the generated PRDG sequences across arbitrary number of data generator nodes in a shared-nothing environment, and (B) use *function shipping* (i.e. re-compute) instead of *data shipping* (i.e. transfer over the network) to get the contents of a referenced record generated on a remote node.
|
|
|
|
|
|
|
|
|
First Steps
|
|
|
-----------
|
|
|
|
|
|
If you want to learn more about the *Myriad Toolkit*, please read the [Getting Started Guide](/TU-Berlin-DIMA/myriad-toolkit/wiki/GettingStarted) and the [Quick Tour](/TU-Berlin-DIMA/myriad-toolkit/wiki/QuickTour).
|
|
|
If you want to learn more about the *Myriad Toolkit*, please read the [Getting Started Guide](/TU-Berlin-DIMA/myriad-toolkit/wiki/GettingStarted) and the [XML Specification Reference Manual](/TU-Berlin-DIMA/myriad-toolkit/wiki/XMLSpecification-v10).
|
|
|
|
|
|
To get a running demo of a simple data generator, please check the [vldb-demo](https://github.com/TU-Berlin-DIMA/vldb-demo) package.
|
|
|
|
... | ... | |