PyTorch Datasets¶

This page lists the supported datasets and their corresponding PyTorch’s Dataset class. If you’re interested in the datasets more than in the code, see this page.

LibriMix¶

class asteroid.data.LibriMix(csv_dir, task='sep_clean', sample_rate=16000, n_src=2, segment=3, return_id=False)[source]¶

Bases: torch.utils.data.Dataset

Dataset class for LibriMix source separation tasks.

Parameters

csv_dir (str) – The path to the metadata file.
task (str) –
One of 'enh_single', 'enh_both', 'sep_clean' or 'sep_noisy' :
- 'enh_single' for single speaker speech enhancement.
- 'enh_both' for multi speaker speech enhancement.
- 'sep_clean' for two-speaker clean source separation.
- 'sep_noisy' for two-speaker noisy source separation.
sample_rate (int) – The sample rate of the sources and mixtures.
n_src (int) – The number of sources in the mixture.
segment (int, optional) – The desired sources and mixtures length in s.

References: [1] “LibriMix: An Open-Source Dataset for Generalizable Speech Separation”, Cosentino et al. 2020.

classmethod loaders_from_mini(batch_size=4, **kwargs)[source]¶

Downloads MiniLibriMix and returns train and validation DataLoader.

Parameters

batch_size (int) – Batch size of the Dataloader. Only DataLoader param. To have more control on Dataloader, call mini_from_download and instantiate the DatalLoader.
**kwargs – keyword arguments to pass the LibriMix, see __init__. The kwargs will be fed to both the training set and validation set.

Returns

train_loader, val_loader – training and validation DataLoader out of LibriMix Dataset.

Examples

>>> from asteroid.data import LibriMix
>>> train_loader, val_loader = LibriMix.loaders_from_mini(
>>>     task='sep_clean', batch_size=4
>>> )

classmethod mini_from_download(**kwargs)[source]¶

Downloads MiniLibriMix and returns train and validation Dataset. If you want to instantiate the Dataset by yourself, call mini_download that returns the path to the path to the metadata files.

Parameters: **kwargs – keyword arguments to pass the LibriMix, see __init__. The kwargs will be fed to both the training set and validation set
Returns: train_set, val_set – training and validation instances of LibriMix (data.Dataset).

Examples

>>> from asteroid.data import LibriMix
>>> train_set, val_set = LibriMix.mini_from_download(task='sep_clean')

static mini_download()[source]¶

Downloads MiniLibriMix from Zenodo in current directory

Returns: The path to the metadata directory.

get_infos()[source]¶

Get dataset infos (for publishing models).

Returns: dict, dataset infos with keys dataset, task and licences.

Wsj0mix¶

class asteroid.data.Wsj0mixDataset(json_dir, n_src=2, sample_rate=8000, segment=4.0)[source]¶

Bases: torch.utils.data.Dataset

Dataset class for the wsj0-mix source separation dataset.

Parameters

json_dir (str) – The path to the directory containing the json files.
sample_rate (int, optional) – The sampling rate of the wav files.
segment (float, optional) – Length of the segments used for training, in seconds. If None, use full utterances (e.g. for test).
n_src (int, optional) – Number of sources in the training targets.

References: “Deep clustering: Discriminative embeddings for segmentation and separation”, Hershey et al. 2015.

__getitem__(idx)[source]¶: Gets a mixture/sources pair. :returns: mixture, vstack([source_arrays])

get_infos()[source]¶

Get dataset infos (for publishing models).

Returns: dict, dataset infos with keys dataset, task and licences.

WHAM!¶

class asteroid.data.WhamDataset(json_dir, task, sample_rate=8000, segment=4.0, nondefault_nsrc=None, normalize_audio=False)[source]¶

Bases: torch.utils.data.Dataset

Dataset class for WHAM source separation and speech enhancement tasks.

Parameters

json_dir (str) – The path to the directory containing the json files.
task (str) –
One of 'enh_single', 'enh_both', 'sep_clean' or 'sep_noisy'.
- 'enh_single' for single speaker speech enhancement.
- 'enh_both' for multi speaker speech enhancement.
- 'sep_clean' for two-speaker clean source separation.
- 'sep_noisy' for two-speaker noisy source separation.
sample_rate (int, optional) – The sampling rate of the wav files.
segment (float, optional) – Length of the segments used for training, in seconds. If None, use full utterances (e.g. for test).
nondefault_nsrc (int, optional) – Number of sources in the training targets. If None, defaults to one for enhancement tasks and two for separation tasks.
normalize_audio (bool) – If True then both sources and the mixture are normalized with the standard deviation of the mixture.

References: “WHAM!: Extending Speech Separation to Noisy Environments”, Wichern et al. 2019

__getitem__(idx)[source]¶: Gets a mixture/sources pair. :returns: mixture, vstack([source_arrays])

get_infos()[source]¶

Get dataset infos (for publishing models).

Returns: dict, dataset infos with keys dataset, task and licences.

WHAMR!¶

class asteroid.data.WhamRDataset(json_dir, task, sample_rate=8000, segment=4.0, nondefault_nsrc=None)[source]¶

Bases: torch.utils.data.Dataset

Dataset class for WHAMR source separation and speech enhancement tasks.

Parameters

json_dir (str) – The path to the directory containing the json files.
task (str) –
One of 'sep_clean', 'sep_noisy', 'sep_reverb' or 'sep_reverb_noisy'.
- 'sep_clean' for two-speaker clean (anechoic) source separation.
- 'sep_noisy' for two-speaker noisy (anechoic) source separation.
- 'sep_reverb' for two-speaker clean reverberant source separation.
- 'sep_reverb_noisy' for two-speaker noisy reverberant source separation.
sample_rate (int, optional) – The sampling rate of the wav files.
segment (float, optional) – Length of the segments used for training, in seconds. If None, use full utterances (e.g. for test).
nondefault_nsrc (int, optional) – Number of sources in the training targets. If None, defaults to one for enhancement tasks and two for separation tasks.

References: “WHAMR!: Noisy and Reverberant Single-Channel Speech Separation”, Maciejewski et al. 2020

__getitem__(idx)[source]¶: Gets a mixture/sources pair. :returns: mixture, vstack([source_arrays])

get_infos()[source]¶

Get dataset infos (for publishing models).

Returns: dict, dataset infos with keys dataset, task and licences.

SMS-WSJ¶

class asteroid.data.SmsWsjDataset(json_path, target, dset, sample_rate=8000, single_channel=True, segment=4.0, nondefault_nsrc=None, normalize_audio=False)[source]¶

Bases: torch.utils.data.Dataset

Dataset class for SMS WSJ source separation.

Parameters

json_path (str) – The path to the sms_wsj json file.
target (str) –
One of 'source', 'early' or 'image'.
- 'source' non reverberant clean targets signals.
- 'early' early reverberation target signals.
- 'image' reverberant target signals
dset (str) – train_si284 for train, cv_dev93 for validation and test_eval92 for test
sample_rate (int, optional) – The sampling rate of the wav files.
segment (float, optional) – Length of the segments used for training, in seconds. If None, use full utterances (e.g. for test).
single_channel (bool) – if False all channels are used if True only a random channel is used during training and the first channel during test
nondefault_nsrc (int, optional) – Number of sources in the training targets.
normalize_audio (bool) – If True then both sources and the mixture are normalized with the standard deviation of the mixture.

References: “SMS-WSJ: Database, performance measures, and baseline recipe for multi-channel source separation and recognition”, Drude et al. 2019

__getitem__(idx)[source]¶: Gets a mixture/sources pair. :returns: mixture, vstack([source_arrays])

get_infos()[source]¶

Get dataset infos (for publishing models).

Returns: dict, dataset infos with keys dataset, task and target.

KinectWSJMix¶

class asteroid.data.KinectWsjMixDataset(json_dir, n_src=2, sample_rate=16000, segment=4.0)[source]¶

Bases: asteroid.data.wsj0_mix.Wsj0mixDataset

Dataset class for the KinectWSJ-mix source separation dataset.

Parameters

json_dir (str) – The path to the directory containing the json files.
sample_rate (int, optional) – The sampling rate of the wav files.
segment (float, optional) – Length of the segments used for training, in seconds. If None, use full utterances (e.g. for test).
n_src (int, optional) – Number of sources in the training targets.

References: “Analyzing the impact of speaker localization errors on speech separation for automatic speech recognition”, Sunit Sivasankaran et al. 2020.

__getitem__(idx)[source]¶: Gets a mixture/sources pair. :returns: mixture, stack([source_arrays]), noise

mixture is of dimension [samples, channels] sources are of dimension [n_src, samples, channels]

get_infos()[source]¶

Get dataset infos (for publishing models).

Returns: dict, dataset infos with keys dataset, task and licences.

DNSDataset¶

class asteroid.data.DNSDataset(json_dir)[source]¶

Bases: torch.utils.data.Dataset

Deep Noise Suppression (DNS) Challenge’s dataset.

Args: json_dir (str): path to the JSON directory (from the recipe).
References: “The INTERSPEECH 2020 Deep Noise Suppression Challenge: Datasets, Subjective Testing Framework, and Challenge Results”, Reddy et al. 2020.

__getitem__(idx)[source]¶: Gets a mixture/sources pair. :returns: mixture, vstack([source_arrays])

get_infos()[source]¶

Get dataset infos (for publishing models).

Returns: dict, dataset infos with keys dataset, task and licences.

MUSDB18¶

class asteroid.data.MUSDB18Dataset(root, sources=['vocals', 'bass', 'drums', 'other'], targets=None, suffix='.wav', split='train', subset=None, segment=None, samples_per_track=1, random_segments=False, random_track_mix=False, source_augmentations=<function MUSDB18Dataset.<lambda>>, sample_rate=44100)[source]¶

Bases: torch.utils.data.Dataset

MUSDB18 music separation dataset

The dataset consists of 150 full lengths music tracks (~10h duration) of different genres along with their isolated stems:

drums, bass, vocals and others.

Out-of-the-box, asteroid does only support MUSDB18-HQ which comes as uncompressed WAV files. To use the MUSDB18, please convert it to WAV first:

MUSDB18 HQ: https://zenodo.org/record/3338373
MUSDB18 https://zenodo.org/record/1117372

Note

The datasets are hosted on Zenodo and require that users request access, since the tracks can only be used for academic purposes. We manually check this requests.

This dataset asssumes music tracks in (sub)folders where each folder has a fixed number of sources (defaults to 4). For each track, a list of sources and a common suffix can be specified. A linear mix is performed on the fly by summing up the sources

Due to the fact that all tracks comprise the exact same set of sources, random track mixing can be used can be used, where sources from different tracks are mixed together.

Folder Structure:

>>> #train/1/vocals.wav ---------|
>>> #train/1/drums.wav ----------+--> input (mix), output[target]
>>> #train/1/bass.wav -----------|
>>> #train/1/other.wav ---------/

Parameters

root (str) – Root path of dataset
sources (list of str, optional) – List of source names that composes the mixture. Defaults to MUSDB18 4 stem scenario: vocals, drums, bass, other.
targets (list or None, optional) –
List of source names to be used as targets. If None, a dict with the 4 stems is returned.

If e.g [vocals, drums], a tensor with stacked vocals and drums is returned instead of a dict. Defaults to None.
suffix (str, optional) – Filename suffix, defaults to .wav.
split (str, optional) – Dataset subfolder, defaults to train.
subset (list of str, optional) – Selects a specific of list of tracks to be loaded, defaults to None (loads all tracks).
segment (float, optional) – Duration of segments in seconds, defaults to None which loads the full-length audio tracks.
samples_per_track (int, optional) – Number of samples yielded from each track, can be used to increase dataset size, defaults to 1.
random_segments (boolean, optional) – Enables random offset for track segments.
boolean (random_track_mix) – enables mixing of random sources from different tracks to assemble mix.
source_augmentations (list of callable) – list of augmentation function names, defaults to no-op augmentations (input = output)
sample_rate (int, optional) – Samplerate of files in dataset.

Variables

root (str) – Root path of dataset
sources (list of str, optional) – List of source names. Defaults to MUSDB18 4 stem scenario: vocals, drums, bass, other.
suffix (str, optional) – Filename suffix, defaults to .wav.
split (str, optional) – Dataset subfolder, defaults to train.
subset (list of str, optional) – Selects a specific of list of tracks to be loaded, defaults to None (loads all tracks).
segment (float, optional) – Duration of segments in seconds, defaults to None which loads the full-length audio tracks.
samples_per_track (int, optional) – Number of samples yielded from each track, can be used to increase dataset size, defaults to 1.
random_segments (boolean, optional) – Enables random offset for track segments.
boolean (random_track_mix) – enables mixing of random sources from different tracks to assemble mix.
source_augmentations (list of callable) – list of augmentation function names, defaults to no-op augmentations (input = output)
sample_rate (int, optional) – Samplerate of files in dataset.
tracks (list of Dict) – List of track metadata

References: “The 2018 Signal Separation Evaluation Campaign” Stoter et al. 2018.

get_tracks()[source]¶: Loads input and output tracks

get_infos()[source]¶

Get dataset infos (for publishing models).

Returns: dict, dataset infos with keys dataset, task and licences.

DAMP-VSEP¶

class asteroid.data.DAMPVSEPSinglesDataset(root_path, task, split='train_singles', ex_per_track=1, random_segments=False, sample_rate=16000, segment=None, norm=None, source_augmentations=None, mixture='original')[source]¶

Bases: torch.utils.data.Dataset

DAMP-VSEP vocal separation dataset

This dataset utilises one of the two preprocessed versions of DAMP-VSEP from https://github.com/groadabike/DAMP-VSEP-Singles aimed for SINGLE SINGER separation.

The DAMP-VSEP dataset is hosted on Zenodo. https://zenodo.org/record/3553059

Parameters

root_path (str) – Root path to DAMP-VSEP dataset.
task (str) –
one of 'enh_vocal',``’separation’``.
- 'enh_vocal' for vocal enhanced.
- 'separation' for vocal and background separation.
split (str) – one of 'train_english', 'train_singles', 'valid' and 'test'. Default to 'train_singles'.
ex_per_track (int, optional) – Number of samples yielded from each track, can be used to increase dataset size, defaults to 1.
random_segments (boolean, optional) – Enables random offset for track segments.
sample_rate (int, optional) – Sample rate of files in dataset. Default 16000 Hz
segment (float, optional) – Duration of segments in seconds, Defaults to None which loads the full-length audio tracks.
norm (Str, optional) –
Type of normalisation to use. Default to None
- 'song_level' use mixture mean and std.
- `None` no normalisation
source_augmentations (Callable, optional) – Augmentations applied to the sources (only). Default to None.
mixture (str, optional) –
Whether to use the original mixture with non-linear effects or remix sources. Default to original.
- 'remix' for use addition to remix the sources.
- 'original' for use the original mixture.

Note

There are 2 train set available:

train_english: Uses all English spoken song. Duets are converted into 2 singles. Totalling 9243 performances and 77Hrs.
train_singles: Uses all singles performances, discarding all duets. Totalling 20660 performances and 149 hrs.

get_tracks()[source]¶: Loads metadata with tracks info. Raises error if metadata doesn’t exist.

get_infos()[source]¶

Get dataset infos (for publishing models).

Returns: dict, dataset infos with keys dataset, task and licences.

FUSS¶

class asteroid.data.FUSSDataset(file_list_path, return_bg=False)[source]¶

Bases: torch.utils.data.Dataset

Dataset class for FUSS [1] tasks.

Parameters

file_list_path (str) – Path to the txt (csv) file created at stage 2 of the recipe.
return_bg (bool) – Whether to return the background along the mixture and sources (useful for SIR, SAR computation). Default: False.

References: [1] Scott Wisdom et al. “What’s All the FUSS About Free Universal Sound Separation Data?”, 2020, in preparation.

get_infos()[source]¶

Get dataset infos (for publishing models).

Returns: dict, dataset infos with keys dataset, task and licences.

AVSpeech¶

class asteroid.data.AVSpeechDataset(input_df_path: Union[str, pathlib.Path], embed_dir: Union[str, pathlib.Path], n_src=2)[source]¶

Bases: torch.utils.data.Dataset

Audio Visual Speech Separation dataset as described in [1].

Parameters

input_df_path (str,Path) – path for combination dataset.
embed_dir (str,Path) – path where embeddings are stored.
n_src (int) – number of sources.

References: [1] “Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation” Ephrat et. al https://arxiv.org/abs/1804.03619