matgraphdb.materials.nodes.materials.MaterialStore

class MaterialStore(storage_path: str, initialize_kwargs: dict | None = None)

A class that inherits from NodeStore.

__init__(storage_path: str, initialize_kwargs: dict | None = None)
Parameters:

storage_path (str) – The path where ParquetDB files for this node type are stored.

Methods

__init__(storage_path[, initialize_kwargs])

backup_database(backup_path)

Creates a complete backup of the current dataset.

construct_table(data[, schema, metadata, ...])

Constructs a PyArrow Table from various input data formats.

copy_dataset(dest_name[, overwrite])

Creates a complete copy of the current dataset under a new name.

create(data[, schema, metadata, ...])

Adds new data to the database.

create_material([structure, coords, ...])

Adds a material to the database with optional symmetry and calculated properties.

create_materials(materials[, schema, ...])

Adds multiple materials to the database in a single transaction.

create_nodes(data[, schema, metadata, ...])

Adds new data to the database.

dataset_exists([dataset_name])

Check if a dataset exists and contains data.

delete([ids, filters, columns, normalize_config])

Deletes records or columns from the database.

delete_materials([ids, columns, ...])

Deletes records from the database by ID.

delete_nodes([ids, columns, normalize_config])

Deletes records from the database.

drop_dataset()

Removes the current dataset directory and reinitializes it with an empty table.

export_dataset(file_path[, format])

Exports the entire dataset to a single file in the specified format.

export_partitioned_dataset(export_dir, ...)

Exports the dataset to a partitioned format in the specified directory.

get_current_files()

Get a list of all Parquet files in the current dataset.

get_field_metadata([field_names, return_bytes])

Retrieves metadata for specified fields/columns in the dataset.

get_field_names([columns, include_cols])

Get the names of fields/columns in the dataset schema.

get_file_sizes([verbose])

Get the size of each file in the dataset in MB.

get_metadata([return_bytes])

Retrieves the metadata of the dataset table.

get_n_rows_per_row_group_per_file([as_dict])

Get the number of rows in each row group for each file.

get_number_of_row_groups_per_file()

Get the number of row groups in each Parquet file in the dataset.

get_number_of_rows_per_file()

Get the number of rows in each Parquet file in the dataset.

get_parquet_column_metadata_per_file([as_dict])

Get detailed metadata for each column in each row group in each file.

get_parquet_file_metadata_per_file([as_dict])

Get the metadata for each Parquet file in the dataset.

get_parquet_file_row_group_metadata_per_file([...])

Get detailed metadata for each row group in each Parquet file.

get_row_group_sizes_per_file([verbose])

Get the size of each row group for each file.

get_schema()

Get the PyArrow schema of the dataset.

get_serialized_metadata_size_per_file()

Get the serialized metadata size for each Parquet file in the dataset.

import_dataset(file_path[, format])

Imports data from a file into the dataset, supporting multiple file formats.

initialize(**kwargs)

is_empty()

Check if the dataset is empty.

merge_datasets(source_tables, dest_table)

normalize([normalize_config])

Normalize the dataset by restructuring files for optimal performance.

normalize_nodes([normalize_config])

Normalize the dataset by restructuring files for consistent row distribution.

preprocess_table(table[, ...])

Preprocesses a PyArrow table by flattening nested structures and handling special field types.

process_data_with_python_objects(data[, ...])

Processes input data and handles Python object serialization.

read([ids, columns, filters, load_format, ...])

Reads data from the database with flexible filtering and formatting options.

read_materials([ids, columns, filters, ...])

Reads data from the MaterialStore.

read_nodes([ids, columns, filters, ...])

Reads data from the database.

rename_dataset(new_name[, remove_dest])

Renames the current dataset directory and all contained files.

rename_fields(name_map[, normalize_config])

Rename fields/columns in the dataset using a mapping dictionary.

restore_database(backup_path)

Restores the dataset from a previous backup.

set_field_metadata(fields_metadata[, update])

Sets or updates metadata for specific fields/columns in the dataset.

set_metadata(metadata[, update])

Sets or updates the metadata of the dataset table.

sort_fields([normalize_config])

Sort the fields/columns of the dataset alphabetically by name.

summary([show_column_names])

Generate a formatted summary string containing database information and metadata.

to_nested([nested_dataset_dir, ...])

Converts the current dataset to a nested structure optimized for querying nested data.

transform(transform_callable[, new_db_path, ...])

Transform the entire dataset using a user-provided callable.

update(data[, schema, metadata, ...])

Updates existing records in the database by matching on specified key fields.

update_materials(data[, schema, metadata, ...])

Updates existing records in the database.

update_nodes(data[, schema, metadata, ...])

Updates existing records in the database.

update_schema([field_dict, schema, ...])

Updates the schema of the table in the dataset.

Attributes

basename_template

Get the template for parquet file basenames.

columns

Get the column names in the database.

dataset_name

Get the dataset name.

db_path

Get the database path.

n_columns

Get the number of columns in the database.

n_features

n_files

Get the number of parquet files in the database.

n_nodes

n_row_groups_per_file

Get the number of row groups in each parquet file.

n_rows

Get the total number of rows in the database.

n_rows_per_file

Get the number of rows in each parquet file.

n_rows_per_row_group_per_file

Get the number of rows in each row group for each file.

name_column

node_metadata_keys

serialized_metadata_size_per_file

Get the size of serialized metadata for each file.

storage_path