|
Myriad Data Generator Toolkit
|
|
Myriad Data Generator Toolkit
|
|
=============================
|
|
=============================
|
|
|
|
|
|
*Myriad* is a development toolkit for scalable parallel data generators. Generating large sets of synthetic data according to a predefined schema and a set of statistical restrictions is a challenging yet increasingly important task, especially in the context of benchmarking and testing systems designed for management and processing of web-scale data like [Hadoop](hadoop.apache.org) or parallel RDBMS like [DB2](www-01.ibm.com/software/data/db2/). Myriad aims to ease this process by providing you with a fast and easy way to create your own data generators. All generators created with the toolkit are scale-out ready and support parallelization on shared-nothing architectures.
|
|
*Myriad* is a development toolkit for scalable parallel data generators. Generating large sets of synthetic data according to a predefined schema and a set of statistical restrictions is a challenging yet increasingly important task, especially in the context of benchmarking and testing systems designed for management and processing of web-scale data like [Hadoop](hadoop.apache.org) or parallel RDBMS like [DB2](www-01.ibm.com/software/data/db2/). The *Myriad Toolkit* aims to ease this process by offering developers a fast and easy way to create their own data generators. All data generators created with the toolkit scale-out linearly on shared-nothing architectures.
|
|
|
|
|
|
|
|
|
|
Core Features
|
|
Core Features
|
... | @@ -9,21 +9,21 @@ Core Features |
... | @@ -9,21 +9,21 @@ Core Features |
|
|
|
|
|
The main functional advantage from the use of the toolkit is the built-in parallelization support of the produced generators. Our parallelization approach builds on the idea of mapping fix-sized chunks from an underlying pseudo-random number generator (PRNG) into a pseudo-random stream of records. The horizontal partitioning parallel execution model implemented by the toolkit relies on the use of efficient `skip-ahead` PRNG operations to advance to the starting position of the assigned record substreams in each generator node.
|
|
The main functional advantage from the use of the toolkit is the built-in parallelization support of the produced generators. Our parallelization approach builds on the idea of mapping fix-sized chunks from an underlying pseudo-random number generator (PRNG) into a pseudo-random stream of records. The horizontal partitioning parallel execution model implemented by the toolkit relies on the use of efficient `skip-ahead` PRNG operations to advance to the starting position of the assigned record substreams in each generator node.
|
|
|
|
|
|
Moreover, the same technique facilitates the efficient realization of a broad set of reference-based model restrictions as the random values of each referenced record are completely dependant (and thus easily re-computable) on its sequence number. You can generate a set of `A`-records and a referencing set of `B`-records simply by sampling arbitrary `a` values from the `A`-sequence for each `b` -- regardless of all current partitioning specifics. A restriction of the form `b.y := a.x` (used for instance to set a foreign key in `b`) can be implemented through local re-computation of the interesting value `a.x` based on the position of the `A`-sample.
|
|
Moreover, the same technique facilitates the efficient realization of a broad set of reference-based model restrictions as the random values of each referenced record are completely dependant (and thus easily re-computable) on its sequence number. You can generate a set of `A`-records and a referencing set of `B`-records simply by sampling arbitrary `a` records from the `A`-sequence for each `b` record, regardless of the actual partitioning and parallelization specifics. Restrictions of the form `b.y := f(a)` (which for instance are used to set proper foreign keys in `b`) can be implemented through local re-computation of the sampled `a` records based on the position of the `A` sequence.
|
|
|
|
|
|
|
|
|
|
Extensible Architecture
|
|
Extensible Architecture
|
|
-----------------------
|
|
-----------------------
|
|
|
|
|
|
The Myriad toolkit leverages the development of custom data generators through its extensible object-oriented architecture. Creating a generator for a custom data type can be as easy as defining the type as a _bean object_-style record class and pinning down the random generation logic for a single record instance through a set of reusable _record hydration_ components.
|
|
The *Myriad Toolkit* leverages the development of custom data generators through its extensible, object-oriented architecture. Creating a generator for a custom data type can be as easy as defining the *domain model* - a set of *domain types* (also called *records*), and a corresponding set of *setter chains* - functions that map a chunk of random numbers to a particular record instance of the corresponding domain type.
|
|
|
|
|
|
|
|
|
|
First Steps
|
|
First Steps
|
|
-----------
|
|
-----------
|
|
|
|
|
|
If you want to learn more about using Myriad, please read the [Getting Started Guide](/TU-Berlin-DIMA/myriad-toolkit/wiki/GettingStarted) and the [Quick Tour](/TU-Berlin-DIMA/myriad-toolkit/wiki/QuickTour).
|
|
If you want to learn more about the *Myriad Toolkit*, please read the [Getting Started Guide](/TU-Berlin-DIMA/myriad-toolkit/wiki/GettingStarted) and the [Quick Tour](/TU-Berlin-DIMA/myriad-toolkit/wiki/QuickTour).
|
|
|
|
|
|
To get a running demo of a simple generator, please check the [myriad-demo](https://github.com/TU-Berlin-DIMA/myriad-demo) package.
|
|
To get a running demo of a simple generator, please check the [vldb-demo](https://github.com/TU-Berlin-DIMA/vldb-demo) package.
|
|
|
|
|
|
<!--
|
|
<!--
|
|
The core set of runtime components provided by the toolkit are written in C++ on top of the [POCO C++ libraries](http://pocoproject.org/).
|
|
The core set of runtime components provided by the toolkit are written in C++ on top of the [POCO C++ libraries](http://pocoproject.org/).
|
... | | ... | |