|
|
Myriad Data Generator Toolkit
|
|
|
=============================
|
|
|
|
|
|
*Myriad* is a development toolkit for scalable parallel data generators. Generating large sets of synthetic data according to a predefined schema and a set of statistical restrictions is a challenging yet increasingly important task, especially in the context of benchmarking and testing systems designed for management and processing of web-scale data like [Hadoop](hadoop.apache.org) or parallel RDBMS like [DB2](www-01.ibm.com/software/data/db2/). The *Myriad Toolkit* aims to ease this process by offering developers a fast and easy way to create their own data generators. All data generators created with the toolkit scale-out linearly on shared-nothing architectures.
|
|
|
*Myriad* is a development toolkit for scalable data generation. Generating large synthetic datasets with a certain schema and a set of statistical properties is a challenging yet increasingly important task, especially in the context of benchmarking and testing systems designed for management of web-scale data like [Hadoop](hadoop.apache.org) or parallel RDBMS like [DB2](www-01.ibm.com/software/data/db2/). The *Myriad Toolkit* aims to ease this process by offering a fast and easy way to develop data generators that can generate dependent data on independently running nodes.
|
|
|
|
|
|
|
|
|
Core Features
|
|
|
-------------
|
|
|
|
|
|
The main functional advantage when using the *Myriad Toolkit* is the built-in parallelization support of the specified data generation programs. Our parallelization approach builds on the idea of mapping fix-sized chunks from an underlying pseudo-random number generator (PRNG) into a pseudo-random stream of records. The horizontal partitioning parallel execution model implemented by the *Myriad Toolkit* relies on the use of efficient `skip-ahead` PRNG operations to advance to the starting position of the assigned record substreams in each generator node.
|
|
|
The *Myriad Toolkit* has two main components: a generic *C++ runtime library* for scalable data generation, and a *Python specification compiler* that generates library extensions from a user-defined data generator specification written in XML.
|
|
|
|
|
|
Moreover, the same technique facilitates the efficient realization of a broad set of reference-based model restrictions as the random values of each referenced record are completely dependant (and thus easily re-computable) on its sequence number. You can generate a set of `A`-records and a referencing set of `B`-records simply by sampling arbitrary `a` records from the `A`-sequence for each `b` record, regardless of the actual partitioning and parallelization specifics. Restrictions of the form `b.y := f(a)` (which for instance are used to set proper foreign keys in `b`) can be implemented through local re-computation of the sampled `a` records based on the position of the `A` sequence.
|
|
|
An XML specification contains the structure of the generated *domain model* - a family of user-defined *domain types*, as well as a corresponding family of *pseudo-random domain type generators (PRDGs)*.
|
|
|
|
|
|
Essentially, a PRDG is a function that generates a sequence of pseudo-random domain type records from an underlying sequence of pseudo-random numbers. In the XML specification PRDGs are realized as a chain of setter functions. Applying a setter to a generated record assigns (i.e. sets) a specific value to one or more of its components. The *Myriad Toolkit* provides a range of primitive *setters* that can be used to construct PRDGs with different statistical properties (e.g. value distributions for specific record fields or value dependencies between several record fields).
|
|
|
|
|
|
Extensible Architecture
|
|
|
-----------------------
|
|
|
|
|
|
The *Myriad Toolkit* leverages the development of custom data generators through its extensible, object-oriented architecture. Creating a generator for a custom data type can be as easy as defining the *domain model* - a set of *domain types* (also called *records*), and a corresponding set of *setter chains* - functions that map a chunk of random numbers to a particular record instance of the corresponding domain type.
|
|
|
Besides the simple specification language, the *Myriad runtime library* also transparently provides parallelization support for all compiled data generators. The underlying sequence of pseudo-random numbers is partitioned in a way that identifies each pseudo-random record with a unique position (i.e. a concrete seed) in the number sequence. In addition, the sequence of pseudo-random numbers is generated by a pseudo-random number generator (PRNG) that supports arbitrary skips to any position in constant time. These runtime-level decisions are critical for efficient parallelization, as they allow us to (A) partition the generated PRDG sequences across an arbitrary number of data generator nodes in a shared-nothing environment, and (B) use *function shipping* (i.e. re-compute) instead of *data shipping* (ship over the network) to get the contents of a referenced record generated on a remote node.
|
|
|
|
|
|
|
|
|
First Steps
|
... | ... | @@ -23,12 +21,20 @@ First Steps |
|
|
|
|
|
If you want to learn more about the *Myriad Toolkit*, please read the [Getting Started Guide](/TU-Berlin-DIMA/myriad-toolkit/wiki/GettingStarted) and the [Quick Tour](/TU-Berlin-DIMA/myriad-toolkit/wiki/QuickTour).
|
|
|
|
|
|
To get a running demo of a simple generator, please check the [vldb-demo](https://github.com/TU-Berlin-DIMA/vldb-demo) package.
|
|
|
To get a running demo of a simple data generator, please check the [vldb-demo](https://github.com/TU-Berlin-DIMA/vldb-demo) package.
|
|
|
|
|
|
|
|
|
Publications
|
|
|
------------
|
|
|
|
|
|
Here is a list of publications that describe the *Myriad Toolkit*:
|
|
|
|
|
|
* [*Myriad: Scalable and Expressive Data Generation*](https://www.stratosphere.eu/sites/default/files/papers/parallelDataGeneration_12.pdf) - Alexander Alexandrov, Kostas Tzoumas, Volker Markl; PVLDB, 5(12), 2012: pp. 1890-1893
|
|
|
* [*Myriad - Parallel Data Generation on Shared-Nothing Architectures*](https://www.stratosphere.eu/sites/default/files/papers/Myriad_11.pdf) - Alexander Alexandrov, Berni Schiefer, John Poelman, Stephan Ewen, Thomas Bodner, Volker Markl; Proceedings of the First Workshop on Architectures and Systems for Big Data (ASBD), 2011
|
|
|
|
|
|
=======
|
|
|
|
|
|
Contact
|
|
|
---------------
|
|
|
-------
|
|
|
|
|
|
For further questions about the Myriad Data Generator Toolkit or any other related questions please use the [mailing list](https://lists.tu-berlin.de/mailman/listinfo/dima-myriad.toolkit).
|
|
|
|
... | ... | @@ -37,6 +43,7 @@ For further questions about the Myriad Data Generator Toolkit or any other relat |
|
|
* [Thomas Bodner, FG DIMA, TU Berlin](mailto:thomas.o.bodner@googlemail.com) - *general assistance*
|
|
|
* [Christoph Brücke, FG DIMA, TU Berlin](mailto:christoph.bruecke@campus.tu-berlin.de) - *general assistance*
|
|
|
|
|
|
|
|
|
Acknowledgements
|
|
|
----------------
|
|
|
|
... | ... | |