|
|
Welcome to the myriad wiki! |
|
|
\ No newline at end of file |
|
|
*Myriad* is a development toolkit for scalable parallel data generators. Generating large sets of synthetic data according to a predefined schema and a set of statistical restrictions is a challenging yet increasingly important task, especially in the context of benchmarking and testing systems designed for management and processing of web-scale data like [Hadoop](hadoop.apache.org) or parallel RDBMS like [DB2](www-01.ibm.com/software/data/db2/). Myriad aims to ease this process by providing you with a fast and easy way to create your own data generators. All generators created with the toolkit are scale-out ready and support parallelization on shared-nothing architectures.
|
|
|
|
|
|
|
|
|
Core Features
|
|
|
=============
|
|
|
|
|
|
The main functional advantage from the use of the toolkit is the built-in parallelization support of the produced generators. Our parallelization approach builds on the idea of mapping fix-sized chunks from an underlying pseudo-random number generator (PRNG) into pseudo-random stream of records. The horizontal partitioning parallel execution model implemented by the toolkit relies on the use use of efficient `skip-ahead` PRNG operations to advance to the starting position of the assigned record substreams in each generator node.
|
|
|
|
|
|
Moreover, the same technique facilitates the efficient realization of a broad set of reference-based model restrictions as the random values of each referenced record are completely dependant (and thus easilly re-computable) on its sequence number. You can generate a set of `A`-records and a referencing set of `B`-records simply by sampling arbitrary `a` values from the `A`-sequence for each `b` -- regardless of all current partitioning specifics. A restriction of the form `b.y := a.x` (used for instance to set a foreign key in `b`) can be implemented through local re-computation of the interesting value `a.x` based on the position of the `A`-sample.
|
|
|
|
|
|
|
|
|
Extensible Architecture
|
|
|
=======================
|
|
|
|
|
|
The Myriad toolkit leverages the development of custom data generators through its extensible object-oriented architecture. Creating a generator for a custom data type can be as easy as defining the type as a _bean object_-style record class and pinning down the random generation logic for a single record instance through a set of reusable _record hydration_ components.
|
|
|
|
|
|
|
|
|
First Steps
|
|
|
===========
|
|
|
|
|
|
If you want to learn more about using Myriad, please read the [Getting Started Guide][GettingStarted] and the [Quick Tour][QuickTour].
|
|
|
|
|
|
To get a running demo of a simple generator, please check the [myriad-demo](https://github.com/TU-Berlin-DIMA/myriad-demo) package.
|
|
|
|
|
|
<!--
|
|
|
The core set of runtime components provided by the toolkit are written in C++ on top of the [POCO C++ libraries](http://pocoproject.org/).
|
|
|
--> |
|
|
\ No newline at end of file |