... | ... | @@ -14,16 +14,21 @@ Core Features |
|
|
The *Myriad Toolkit* consists of two main components:
|
|
|
|
|
|
* a generic C++ *runtime library* for scalable data generation, and
|
|
|
* a Python *prototype compiler* that generates library extensions from a user-defined prototype specification data generator written in XML.
|
|
|
* a Python *prototype compiler* that generates library extensions from a *prototype specification* of a user-defined data generator written in XML.
|
|
|
|
|
|
Using the XML specification language, Myriad users can define the *domain model* to be generated as a family of user-defined *domain types*, and the associated data generation logic as a corresponding family of *pseudo-random domain type generators (PRDGs)*.
|
|
|
Through the use of the compact [*XML specification language*](/TU-Berlin-DIMA/myriad-toolkit/wiki/XML-Specification-Reference-Manual), Myriad users can define the *domain model* to be generated as a family of user-defined *domain types*, and the associated data generation logic as a corresponding family of *pseudo-random domain type generators (PRDGs)*.
|
|
|
|
|
|
In essence, PRDGs are functions that transform a sequence of pseudo-random numbers into a sequence of pseudo-random domain type records. PRDGs are specified as chains of *setter functions*, each one responsible for the assignment of a (random) values to one or more record fields. The *Myriad Toolkit* provides a range of built-in primitive setters that realize various statistical properties (e.g. single field value distributions or value dependencies between record fields).
|
|
|
In essence, PRDGs are functions that transform a sequence of pseudo-random numbers into a sequence of pseudo-random domain type records. PRDGs are specified as chains of *setter functions*, each one responsible for the assignment of a fixed-length substream of values to one or more record fields. The *Myriad Toolkit* provides a range of built-in primitive setters that realize various statistical properties (e.g. single field value distributions or value dependencies between record fields).
|
|
|
|
|
|
The *Myriad runtime library* transparently builds-in parallel execution support in all compiled data generators. To do so, the framework makes sure that the following two conditions always hold. First, each domain record is identified by a unique position (i.e. a concrete seed) in the generating pseudo-random number sequence. Second, the sequence of pseudo-random numbers is generated by a pseudo-random number generator (PRNG) function that supports arbitrary skips to any position on the sequence in constant time. These runtime-level decisions are critical for efficient parallelization, as they allow us to
|
|
|
The *Myriad runtime library* transparently builds-in parallel execution support in all compiled data generators. To do so, the framework makes sure that the following two conditions always hold:
|
|
|
|
|
|
* (A) partition the generated PRDG sequences across arbitrary number of data generator nodes in a shared-nothing environment, and
|
|
|
* (B) use *function shipping* (i.e. re-compute) instead of *data shipping* (i.e. transfer over the network) to get the contents of a referenced record generated on a remote node.
|
|
|
1. Each domain record is identified by a unique position (i.e. a concrete seed) in the generating pseudo-random number sequence.
|
|
|
1. The sequence of pseudo-random numbers is generated by a pseudo-random number generator (PRNG) function that supports arbitrary skips to any position on the sequence in constant time.
|
|
|
|
|
|
These runtime-level decisions are critical for efficient parallelization. More specifically, they allow for
|
|
|
|
|
|
* (A) partitioning the generated PRDG sequences across arbitrary number of data generator nodes in a shared-nothing environment, and
|
|
|
* (B) the use of *function shipping* (i.e. re-compute) instead of *data shipping* (i.e. transfer over the network) to get the contents of a referenced record generated on a remote node.
|
|
|
|
|
|
|
|
|
First Steps
|
... | ... | |