|
|
Myriad Data Generator Toolkit
|
|
|
=============================
|
|
|
|
|
|
<span class="float-left"><img src="/TU-Berlin-DIMA/myriad-toolkit/wiki/img/myriad_logo.floatleft.png" alt="Myriad Toolkit logo" /></span>
|
|
|
|
|
|
*Myriad* is a development toolkit for scalable data generators. Generating large, synthetic datasets with a certain schema and a set of statistical constraints is a challenging yet increasingly important task, especially in the context of benchmarking and testing systems for web-scale data management or parallel RDBMS (e.g. [Hadoop](hadoop.apache.org), [DB2](www-01.ibm.com/software/data/db2/)).
|
|
|
|
|
|
The *Myriad Toolkit* aims to simplify this process by providing a fast and easy way to develop data generators that can generate *statistically dependent data* in parallel on a set of *independently running nodes*.
|
|
|
|
|
|
|
|
|
Core Features
|
|
|
-------------
|
|
|
|
|
|
The *Myriad Toolkit* consists of two main components:
|
|
|
|
|
|
* a generic C++ *runtime library* for scalable data generation, and
|
|
|
* a Python *prototype compiler* that generates library extensions from a *prototype specification* of a user-defined data generator written in XML.
|
|
|
|
|
|
Through the use of a compact [*XML specification language*](/TU-Berlin-DIMA/myriad-toolkit/wiki/XML-Specification-Reference-Manual), Myriad users can define the *domain model* to be generated as a family of user-defined *domain types*, and the associated data generation logic as a corresponding family of *pseudo-random domain type generators (PRDGs)*.
|
|
|
|
|
|
In essence, PRDGs are functions that transform a sequence of pseudo-random numbers into a sequence of pseudo-random domain type records. PRDGs are specified as chains of *setter functions*, each one responsible for the assignment of a fixed-length substream of values to one or more record fields. The *Myriad Toolkit* provides a range of built-in primitive setters that realize various statistical properties (e.g. single field value distributions or value dependencies between record fields).
|
|
|
|
|
|
The *Myriad runtime library* transparently builds-in parallel execution support in all compiled data generators. To do so, the framework makes sure that the following two conditions always hold:
|
|
|
|
|
|
1. Each domain record is identified by a unique position (i.e. a concrete seed) in the generating pseudo-random number sequence.
|
|
|
1. The sequence of pseudo-random numbers is generated by a pseudo-random number generator (PRNG) function that supports arbitrary skips to any position on the sequence in constant time.
|
|
|
|
|
|
These runtime-level decisions are critical for efficient parallelization. More specifically, they allow for
|
|
|
|
|
|
* (A) partitioning the generated PRDG sequences across arbitrary number of data generator nodes in a shared-nothing environment, and
|
|
|
* (B) the use of *function shipping* (i.e. re-compute) instead of *data shipping* (i.e. transfer over the network) to get the contents of a referenced record generated on a remote node.
|
|
|
|
|
|
|
|
|
First Steps
|
|
|
-----------
|
|
|
|
|
|
If you want to learn more about the *Myriad Toolkit*, please read the [Quick Start Guide](/TU-Berlin-DIMA/myriad-toolkit/wiki/Quick-Start-Guide) and the [XML Specification Reference Manual](/TU-Berlin-DIMA/myriad-toolkit/wiki/XML-Specification-Reference-Manual).
|
|
|
|
|
|
To get a running demo of a simple data generator, please check the [vldb-demo](https://github.com/TU-Berlin-DIMA/vldb-demo) package.
|
|
|
|
|
|
|
|
|
Publications
|
|
|
------------
|
|
|
|
|
|
Here is a list of publications that describe the *Myriad Toolkit*:
|
|
|
|
|
|
* [*Myriad: Scalable and Expressive Data Generation*](https://www.stratosphere.eu/sites/default/files/papers/parallelDataGeneration_12.pdf) - Alexander Alexandrov, Kostas Tzoumas, Volker Markl; PVLDB, 5(12), 2012: pp. 1890-1893
|
|
|
* [*Myriad - Parallel Data Generation on Shared-Nothing Architectures*](https://www.stratosphere.eu/sites/default/files/papers/Myriad_11.pdf) - Alexander Alexandrov, Berni Schiefer, John Poelman, Stephan Ewen, Thomas Bodner, Volker Markl; Proceedings of the First Workshop on Architectures and Systems for Big Data (ASBD), 2011
|
|
|
|
|
|
|
|
|
Contact
|
|
|
-------
|
|
|
|
|
|
For further questions about the Myriad Data Generator Toolkit or any other related questions please use the [mailing list](https://lists.tu-berlin.de/mailman/listinfo/dima-myriad.toolkit).
|
|
|
|
|
|
* [Prof. Dr. rer. nat. Volker Markl, FG DIMA, TU Berlin](http://www.dima.tu-berlin.de/menue/mitarbeiter/volker_markl/) - *principal investigator*
|
|
|
* [Alexander Alexandrov, FG DIMA, TU Berlin](mailto:alexander.alexandrov@tu-berlin.de) - *lead developer*
|
|
|
* [Thomas Bodner, FG DIMA, TU Berlin](mailto:thomas.o.bodner@googlemail.com) - *general assistance*
|
|
|
* [Christoph Brücke, FG DIMA, TU Berlin](mailto:christoph.bruecke@campus.tu-berlin.de) - *general assistance*
|
|
|
|
|
|
|
|
|
Acknowledgements
|
|
|
----------------
|
|
|
|
|
|
The Myriad Toolkit is developed as part of the [Stratosphere Project](https://www.stratosphere.eu/) at the [Fachgebiet Datenbanksysteme und Informationsmanagement, TU Berlin](http://www.dima.tu-berlin.de/) under the supervision of [Prof. Dr. rer. nat. Volker Markl](http://www.dima.tu-berlin.de/menue/mitarbeiter/volker_markl/).
|
|
|
|
|
|
Myriad Data Generator Toolkit
|
|
|
=============================
|
|
|
|
|
|
<span class="float-left"><img src="/TU-Berlin-DIMA/myriad-toolkit/wiki/img/myriad_logo.floatleft.png" alt="Myriad Toolkit logo" /></span>
|
|
|
|
|
|
*Myriad* is a development toolkit for scalable data generators. Generating large, synthetic datasets with a certain schema and a set of statistical constraints is a challenging yet increasingly important task, especially in the context of benchmarking and testing systems for web-scale data management or parallel RDBMS (e.g. [Hadoop](hadoop.apache.org), [DB2](www-01.ibm.com/software/data/db2/)).
|
|
|
|
|
|
The *Myriad Toolkit* aims to simplify this process by providing a fast and easy way to develop data generators that can generate *statistically dependent data* in parallel on a set of *independently running nodes*.
|
|
|
|
|
|
|
|
|
Core Features
|
|
|
-------------
|
|
|
|
|
|
The *Myriad Toolkit* consists of two main components:
|
|
|
|
|
|
* a generic C++ *runtime library* for scalable data generation, and
|
|
|
* a Python *prototype compiler* that generates library extensions from a *prototype specification* of a user-defined data generator written in XML.
|
|
|
|
|
|
Through the use of a compact [*XML specification language*](/TU-Berlin-DIMA/myriad-toolkit/wiki/XML-Specification-Reference-Manual), Myriad users can define the *domain model* to be generated as a family of user-defined *domain types*, and the associated data generation logic as a corresponding family of *pseudo-random domain type generators (PRDGs)*.
|
|
|
|
|
|
In essence, PRDGs are functions that transform a sequence of pseudo-random numbers into a sequence of pseudo-random domain type records. PRDGs are specified as chains of *setter functions*, each one responsible for the assignment of a fixed-length substream of values to one or more record fields. The *Myriad Toolkit* provides a range of built-in primitive setters that realize various statistical properties (e.g. single field value distributions or value dependencies between record fields).
|
|
|
|
|
|
The *Myriad runtime library* transparently builds-in parallel execution support in all compiled data generators. To do so, the framework makes sure that the following two conditions always hold:
|
|
|
|
|
|
1. Each domain record is identified by a unique position (i.e. a concrete seed) in the generating pseudo-random number sequence.
|
|
|
1. The sequence of pseudo-random numbers is generated by a pseudo-random number generator (PRNG) function that supports arbitrary skips to any position on the sequence in constant time.
|
|
|
|
|
|
These runtime-level decisions are critical for efficient parallelization. More specifically, they allow for
|
|
|
|
|
|
* (A) partitioning the generated PRDG sequences across arbitrary number of data generator nodes in a shared-nothing environment, and
|
|
|
* (B) the use of *function shipping* (i.e. re-compute) instead of *data shipping* (i.e. transfer over the network) to get the contents of a referenced record generated on a remote node.
|
|
|
|
|
|
|
|
|
First Steps
|
|
|
-----------
|
|
|
|
|
|
If you want to learn more about the *Myriad Toolkit*, please read the [Quick Start Guide](/TU-Berlin-DIMA/myriad-toolkit/wiki/Quick-Start-Guide) and the [XML Specification Reference Manual](/TU-Berlin-DIMA/myriad-toolkit/wiki/XML-Specification-Reference-Manual).
|
|
|
|
|
|
To get a running demo of a simple data generator, please check the [vldb-demo](https://github.com/TU-Berlin-DIMA/vldb-demo) package.
|
|
|
|
|
|
|
|
|
Publications
|
|
|
------------
|
|
|
|
|
|
Here is a list of publications that describe the *Myriad Toolkit*:
|
|
|
|
|
|
* [*Myriad: Scalable and Expressive Data Generation*](https://www.stratosphere.eu/assets/parallelDataGeneration_12.pdf) - Alexander Alexandrov, Kostas Tzoumas, Volker Markl; PVLDB, 5(12), 2012: pp. 1890-1893
|
|
|
* [*Myriad - Parallel Data Generation on Shared-Nothing Architectures*](https://www.stratosphere.eu/assets/Myriad_11.pdf) - Alexander Alexandrov, Berni Schiefer, John Poelman, Stephan Ewen, Thomas Bodner, Volker Markl; Proceedings of the First Workshop on Architectures and Systems for Big Data (ASBD), 2011
|
|
|
|
|
|
|
|
|
Contact
|
|
|
-------
|
|
|
|
|
|
For further questions about the Myriad Data Generator Toolkit or any other related questions please use the [mailing list](https://lists.tu-berlin.de/mailman/listinfo/dima-myriad.toolkit).
|
|
|
|
|
|
* [Prof. Dr. rer. nat. Volker Markl, FG DIMA, TU Berlin](http://www.dima.tu-berlin.de/menue/mitarbeiter/volker_markl/) - *principal investigator*
|
|
|
* [Alexander Alexandrov, FG DIMA, TU Berlin](mailto:alexander.alexandrov@tu-berlin.de) - *lead developer*
|
|
|
* [Thomas Bodner, FG DIMA, TU Berlin](mailto:thomas.o.bodner@googlemail.com) - *general assistance*
|
|
|
* [Christoph Brücke, FG DIMA, TU Berlin](mailto:christoph.bruecke@campus.tu-berlin.de) - *general assistance*
|
|
|
|
|
|
|
|
|
Acknowledgements
|
|
|
----------------
|
|
|
|
|
|
The Myriad Toolkit is developed as part of the [Stratosphere Project](https://www.stratosphere.eu/) at the [Fachgebiet Datenbanksysteme und Informationsmanagement, TU Berlin](http://www.dima.tu-berlin.de/) under the supervision of [Prof. Dr. rer. nat. Volker Markl](http://www.dima.tu-berlin.de/menue/mitarbeiter/volker_markl/).
|
|
|
|
|
|
The project is funded by the [Deutsche Forschungsgemeinschaft](http://www.dfg.de), the [European Institute of Innovation and Technology](http://eit.europa.eu/), and the [IBM Centre for Advanced Studies, Toronto](https://www-927.ibm.com/ibm/cas/). |
|
|
\ No newline at end of file |