Dataset¶
Auto-generated documentation for musicalgestures._dataset module.
Dataset and Corpus classes for managing collections of media files.
- Mgt-python / Modules / Musicalgestures / Dataset
class MgDataset manages a collection of media files (video or audio) and
provides batch processing, train/test splitting, and metadata management,
following conventions from :mod:librosa and MNE-Python.
class MgCorpus is a higher-level convenience wrapper that scans a directory
tree for media files and builds an class MgDataset automatically.
Examples¶
>>> from musicalgestures._dataset import MgDataset
>>> ds = MgDataset.from_directory("/path/to/videos", pattern="*.avi")
>>> train, test = ds.train_test_split(test_size=0.2)
>>> for item in train:
... print(item["path"], item["label"])
## MediaItem
[[find in source code]](https://github.com/fourMs/MGT-python/blob/master/musicalgestures/_dataset.py#L34)
```python
dataclass
class MediaItem():
A single item in an :class:MgDataset.
Parameters¶
path: Absolute path to the media file. label: Optional class label or annotation string. metadata: Optional free-form metadata dict.
MediaItem().is_audio¶
True if this is a recognised audio file.
MediaItem().is_video¶
True if this is a recognised video file.
MediaItem().stem¶
Filename without extension.
MediaItem().suffix¶
File extension (lower-case).
MgCorpus¶
class MgCorpus(MgDataset):
def __init__(
root: str | Path,
pattern: str = '**/*',
label_from: str = 'parent',
) -> None:
Corpus: an :class:MgDataset built by scanning a directory tree.
This is a convenience subclass. Use :meth:MgDataset.from_directory for equivalent functionality.
Parameters¶
root:
Root directory of the corpus.
pattern:
Glob pattern. Default: '**/*'.
label_from:
'parent', 'stem', or 'none'. Default: 'parent'.
Examples¶
>>> corpus = MgCorpus("/data/recordings", label_from="parent") # doctest: +SKIP
>>> len(corpus) # doctest: +SKIP
120
>>> train, test = corpus.train_test_split(test_size=0.2) # doctest: +SKIP
#### See also
- [MgDataset](#mgdataset)
## MgDataset
[[find in source code]](https://github.com/fourMs/MGT-python/blob/master/musicalgestures/_dataset.py#L77)
```python
class MgDataset():
def __init__(
items: list[MediaItem] | None = None,
name: str = 'MgDataset',
) -> None:
A labelled collection of media files.
Parameters¶
items: List of :class:MediaItem objects. name: Optional human-readable name for this dataset.
Examples¶
>>> from pathlib import Path
>>> from musicalgestures._dataset import MgDataset, MediaItem
>>> items = [
... MediaItem(Path("/data/dance1.avi"), label="dance"),
... MediaItem(Path("/data/piano1.avi"), label="piano"),
... ]
>>> ds = MgDataset(items, name="demo")
>>> len(ds)
2
### MgDataset().filter
[[find in source code]](https://github.com/fourMs/MGT-python/blob/master/musicalgestures/_dataset.py#L243)
```python
def filter(func) -> 'MgDataset':
Return a new dataset containing only items for which func(item) is True.
Parameters¶
func: Callable accepting a :class:MediaItem and returning bool.
Returns¶
MgDataset
MgDataset().filter_by_label¶
Return a new dataset containing only items with the given label.
Parameters¶
label: Label string to match.
Returns¶
MgDataset
MgDataset.from_directory¶
@classmethod
def from_directory(
directory: str | Path,
pattern: str = '**/*',
label_from: str = 'parent',
recursive: bool = True,
name: str | None = None,
) -> 'MgDataset':
Build a dataset by scanning a directory for media files.
Parameters¶
directory:
Root directory to scan.
pattern:
Glob pattern relative to directory. Default: '**/*'.
label_from:
How to derive labels: 'parent' uses the immediate parent
directory name; 'stem' uses the filename stem;
'none' assigns no label.
recursive:
If True (default), scan sub-directories.
name:
Optional dataset name.
Returns¶
MgDataset
MgDataset.from_json¶
Load a dataset from a JSON file saved by :meth:MgDataset().to_json.
Parameters¶
path: Path to the JSON file.
Returns¶
MgDataset
MgDataset().labels¶
List of all item labels (in order).
MgDataset().to_json¶
Serialise the dataset to JSON.
Parameters¶
path: Optional file path to write. If None, returns the JSON string.
Returns¶
str
MgDataset().train_test_split¶
def train_test_split(
test_size: float = 0.2,
shuffle: bool = True,
seed: int | None = None,
) -> tuple['MgDataset', 'MgDataset']:
Split the dataset into train and test subsets.
Parameters¶
test_size: Fraction of items to include in the test set. Default: 0.2. shuffle: Whether to shuffle before splitting. Default: True. seed: Random seed for reproducibility.
Returns¶
train : MgDataset test : MgDataset
MgDataset().unique_labels¶
Sorted list of unique non-None labels.