torchwrench.extras.hdf.pack module¶

torchwrench.extras.hdf.pack.hdf_dtype_to_fill_value(hdf_dtype: dtype | 'b' | 'i' | 'u' | 'f' | 'c' | type | None) → bool | int | float | complex | None | str | bytes[source]¶

torchwrench.extras.hdf.pack.hdf_dtype_to_numpy_dtype(hdf_dtype: dtype | 'b' | 'i' | 'u' | 'f' | 'c' | type) → dtype[source]¶

torchwrench.extras.hdf.pack.numpy_dtype_to_hdf_dtype(dtype: dtype | None, *, encoding: str = 'utf-8') → dtype[source]¶

torchwrench.extras.hdf.pack.pack_to_hdf(dataset: ~pythonwrench.typing.classes.SupportsGetitemLen[__SPHINX_IMMATERIAL_TYPE_VAR__V_T_DictOrTuple, ~typing.Any] | ~pythonwrench.typing.classes.SupportsIterLen[__SPHINX_IMMATERIAL_TYPE_VAR__V_T_DictOrTuple] | ~typing.Mapping[str, ~pythonwrench.typing.classes.SupportsGetitemLen], hdf_fpath: str | ~pathlib.Path, pre_transform: ~typing.Callable[[__SPHINX_IMMATERIAL_TYPE_VAR__V_T_DictOrTuple], __SPHINX_IMMATERIAL_TYPE_VAR__V_T_DictOrTuple] | None = <function identity>, *, batch_size: int = 32, num_workers: int | ~typing.Literal['auto'] = 'auto', skip_scan: bool = False, encoding: str = 'utf-8', file_kwds: ~typing.Dict[str, ~typing.Any] | None = None, col_kwds: ~typing.Dict[str, ~typing.Any] | None = None, shape_suffix: str = '__shape', store_str_as_vlen: bool = False, user_attrs: ~typing.Any = None, exists: ~typing.Literal['overwrite', 'skip', 'error'] = 'error', ds_kwds: ~typing.Dict[str, ~typing.Any] | None = None, verbose: int = 0) → HDFDataset[T_DictOrTuple, T_DictOrTuple][source]¶

Pack a dataset to HDF file.

Args:

dataset: The sized dataset to pack. Must be sized and all items must be of dict type.: The key of each dictionaries are strings and values can be int, float, str, Tensor, non-empty List[int], non-empty List[float], non-empty List[str]. If values are tensors or lists, the number of dimensions must be the same for all items in the dataset.

hdf_fpath: The path to the HDF file. pre_transform: The optional transform to apply to audio returned by the dataset BEFORE storing it in HDF file.

Can be used for deterministic transforms like Resample, LogMelSpectrogram, etc. defaults to None.

batch_size: The batch size of the dataloader. defaults to 32. num_workers: The number of workers of the dataloader.

If “auto”, it will be set to len(os.sched_getaffinity(0)). defaults to “auto”.

skip_scan: If True, the input dataset will be considered as fully homogeneous, which means that all columns values contains the same shape and dtype, which will be inferred from the first batch.: It is meant to skip the first step which scans each dataset item once and speed up packing to HDF file. defaults to False.

encoding: String encoding used in file. defaults to “utf-8”. file_kwds: Options given to h5py.File object. defaults to None. col_kwds: Options given to all dataset columns, i.e. h5py.File().create_dataset(.) method. defaults to None. shape_suffix: Shape column suffix in HDF file. defaults to “_shape”. store_str_as_vlen: If True, store strings as variable length string dtype. defaults to False. user_attrs: Additional metadata to add to the hdf file. It must be convertible to JSON with json.dumps. defaults to None.

exists: Determine which action should be performed if the target HDF file already exists.: “overwrite”: Replace the target file then pack dataset. “skip”: Skip this function and returns the packed dataset. “error”: Raises a ValueError.

ds_kwds: Keywords arguments passed to the returned HDFDataset instance if the target file already exists and if exists == “skip”. verbose: Verbose level. defaults to 0.

Returns:

hdf_dataset: The target HDF dataset object.