Data Architecture: SpectraStore¶
SpectraStore is the single source of truth for all numerical spectral data in SPECTROview. Both the Spectra and Maps workspaces read from and write to the same store. Understanding this module is the most important step for any new contributor.
Source file: spectra_store.py
Design Philosophy¶
Before SpectraStore (version before 26.24.1), SPECTROview stored each spectrum as an independent Python object. This created performance bottlenecks when working with large hyperspectral maps (thousands of spectra). The tensor-centric redesign stores all spectra for a logical "map" as contiguous NumPy arrays, enabling:
- Vectorized preprocessing — range crop and baseline subtraction applied to the entire N×M matrix in a single operation.
- Batch fitting — the VBF engine operates directly on the tensor without Python-level iteration.
- O(1) map access — retrieving a map's data is a single dictionary lookup, regardless of store size.
- Heterogeneous datasets — each map owns its own x-axis, so maps with different wavenumber ranges coexist cleanly.
Data Hierarchy¶
SpectraStore
└── dict: _maps { map_name → MapData }
│
├── MapData ("spectrum_A") ← Spectra workspace: single spectrum (N=1)
│ ├── x0: float64[M] raw wavenumber axis
│ ├── Y0: float32[1, M] raw intensity (1 spectrum)
│ ├── x: float64[M_proc] processed axis (after crop)
│ ├── Y: float32[1, M_proc] processed intensity (after crop + baseline)
│ ├── coords: float64[1, 2] always (0.0, 0.0) for point spectra
│ ├── fnames: ["spectrum_A"] unique identifier
│ ├── fit_model: dict peak model definition
│ ├── peak_params: float64[1, K] fitted parameters
│ ├── fit_success: bool[1]
│ └── ...
│
└── MapData ("wafer_300mm") ← Maps workspace: hyperspectral map (N=2500)
├── x0: float64[M]
├── Y0: float32[2500, M] raw intensity matrix
├── coords: float64[2500, 2] (X, Y) stage positions
├── fnames: [str × 2500] "wafer_300mm_(x, y)" per spectrum
├── peak_params: float64[2500, K]
├── fit_success: bool[2500]
└── ...
[!IMPORTANT] Both a single spectrum and a full hyperspectral map use the same
MapDatastructure — they only differ in N (the number of rows). This unified model is what enables the Spectra and Maps workspaces to share the same ViewModel base class (VMWorkspaceSpectra).
MapData — The Tensor Block¶
MapData is a dataclass that owns all heavy arrays for one logical dataset. Its fields are organized into functional groups:
Spectral Arrays (immutable raw + mutable processed)¶
| Field | Shape | dtype | Purpose |
|---|---|---|---|
x0 |
[M] |
float64 | Original wavenumber axis — never mutated |
Y0 |
[N, M] |
float32 | Original intensities — never mutated |
x |
[M_proc] |
float64 | Working axis after crop/x-correction (or None → use x0) |
Y |
[N, M_proc] |
float32 | Working intensities after all preprocessing (or None → use Y0) |
coords |
[N, 2] |
float64 | Spatial (X, Y) positions — always (0, 0) for point spectra |
[!IMPORTANT] The dual-array design (
x0/Y0vsx/Y) is the key to non-destructive preprocessing.x0andY0are written once on file load and never touched again. All preprocessing producesxandY. Callingreinit_spectra()simply setsmd.x = None; md.Y = None, making the store fall back to the raw arrays.
Per-Spectrum Metadata¶
| Field | Length | Purpose |
|---|---|---|
fnames |
N | Unique string identifier for each spectrum row |
is_active |
N | bool — whether this spectrum is checked in the UI list |
colors |
N | Optional display color per spectrum |
labels |
N | Optional user-assigned display label |
Preprocessing State (shared across all N spectra in a map)¶
| Field | Type | Purpose |
|---|---|---|
baseline_config |
dict |
Baseline algorithm and parameters (mode, points, coef, ...) |
is_baseline_subtracted |
bool or bool[N] |
Whether Y has had the baseline removed |
range_min / range_max |
float |
Current spectral crop boundaries |
xcorrection_value |
float |
X-axis shift applied (cm⁻¹) |
intensity_norm_factor |
float |
Multiplicative intensity normalization |
map_metadata |
dict |
Acquisition metadata from WDF/SPC files |
Fit Results (filled after VBF engine runs)¶
| Field | Shape | dtype | Purpose |
|---|---|---|---|
peak_params |
[N, K] |
float64 | Optimized parameter values |
fit_success |
[N] |
bool | Convergence flag per spectrum |
fit_r2 |
[N] |
float64 | Coefficient of determination per spectrum |
param_names |
list[str] | — | Column labels for peak_params (e.g. m01_x0, m01_fwhm) |
fit_model |
dict |
— | Peak model definition used for fitting |
Visualization Curves (derived, for rendering only)¶
| Field | Shape | Purpose |
|---|---|---|
Y_bestfit |
[N, M_proc] |
Composite model curve (sum of all peaks + baseline) |
Y_baseline |
[N, M_proc] |
Evaluated baseline curve |
Y_peaks |
list of [N, M_proc] |
One array per peak (for individual peak rendering) |
MapInfo — Lightweight Descriptor¶
@dataclass
class MapInfo:
name: str
row_start: int # always 0 (kept for API compatibility)
row_end: int
n_spectra: int
n_wavenumbers: int
MapInfo is a minimal summary returned by SpectraStore.get_map_info(). It exists for API compatibility — older code that needed to know "where does this map start in the global array" still works without modification, even though the global array no longer exists. Prefer get_map_data() for direct access.
SpectraStore — The Container¶
SpectraStore is a thin dictionary-backed container. Its public API separates cleanly into:
Map Registration¶
store.add_map(name, x0, Y0, coords, fnames, ...) # register a new map
store.remove_map(name) # delete a map and free memory
store.reorder_maps(new_order) # reorder for list display
Data Access¶
md = store.get_map_data(name) # MapData — primary access
info = store.get_map_info(name) # MapInfo — lightweight summary
x, Y = store.get_xy_batch(name, indices) # (processed) arrays for subset
Fit Results¶
store.set_fit_results(name, indices, peak_params, success, r2, param_names, fit_model)
store.build_fit_results_df(name, map_type, peak_labels, only_converged) # → pd.DataFrame
Preprocessing¶
store.batch_preprocess(name, baseline_config, range_min, range_max) # vectorized
store.clear_preprocess(name) # reset to x0/Y0
Serialization¶
store.to_npz_dict(name) # heavy arrays → dict for NPZ
store.to_metadata_dict(name) # lightweight metadata → dict for JSON
SpectraStore.load_map_from_npz(...) # class method: restore from saved arrays
SpectrumProxy and BaselineProxy — View Bridge¶
SpectrumProxy and BaselineProxy are read/write proxy objects that present a single-spectrum interface without duplicating any tensor data.
class SpectrumProxy:
md: MapData # reference to the parent tensor block
idx: int # row index within the tensor
fname: str # unique identifier
@property
def label(self): ... # reads md.labels[idx]
@label.setter # writes md.labels[idx]
@property
def color(self): ... # reads md.colors[idx]
@color.setter # writes md.colors[idx]
The View never holds raw arrays. When VMWorkspaceSpectra._emit_selected_spectra() prepares the payload for VSpectraViewer, it packages a list of SpectrumProxy objects alongside the array slices. This means the View can call proxy.label = "My Label" and the change automatically lands in md.labels[idx] without any explicit callback.
Class Relationships¶
classDiagram
class SpectraStore {
-_maps: dict
+add_map()
+get_map_data() MapData
+get_map_info() MapInfo
+set_fit_results()
+build_fit_results_df() DataFrame
+batch_preprocess()
+to_npz_dict()
+load_map_from_npz()
}
class MapData {
+name: str
+x0: ndarray
+Y0: ndarray
+x: ndarray
+Y: ndarray
+coords: ndarray
+is_active: ndarray
+fnames: list
+baseline_config: dict
+fit_model: dict
+peak_params: ndarray
+Y_bestfit: ndarray
+has_fit_results() bool
+n_spectra: int
}
class MapInfo {
+name: str
+n_spectra: int
+n_wavenumbers: int
+row_start: int
+row_end: int
}
class SpectrumProxy {
+md: MapData
+idx: int
+fname: str
+label: str
+color: str
+baseline: BaselineProxy
}
class BaselineProxy {
+mode: str
+points: list
+is_subtracted: bool
}
SpectraStore "1" *-- "0..*" MapData : owns
MapData "1" <-- "0..*" SpectrumProxy : wraps row of
SpectrumProxy "1" *-- "1" BaselineProxy : owns
MapInfo ..> MapData : describes
Data Management Strategy¶
Storage¶
Each spectrum or map is stored as an independent MapData block in SpectraStore._maps. There is no global array or shared index. This eliminates range arithmetic and makes deletion O(1) (del _maps[name]).
For each MapData, the store holds two array pairs:
| Pair | Contents | Mutability |
|---|---|---|
x0 / Y0 |
Raw, file-loaded data | Read-only after registration |
x / Y |
Preprocessed working data | Overwritten by batch_preprocess(), deleted by clear_preprocess() |
When x or Y is None, the store transparently falls back to x0 / Y0 in all access methods (get_xy_batch, set_plot_data, etc.).
Retrieval¶
ViewModels always retrieve data through SpectraStore.get_map_data(name) which returns a direct reference (not a copy) to the MapData block. This means:
- Reads are zero-copy — no data is duplicated.
- Writes are immediate — any modification to
md.Yis visible everywhere that holds a reference to the samemd.
This is why the ViewModel calls self.store.get_map_data(fname) at the start of every operation, rather than caching md as an instance variable.
Updates and Synchronization¶
The ViewModel follows a consistent update pattern after every state change:
_emit_selected_spectra() re-reads all selected MapData blocks and emits a fresh payload to the viewer. _emit_list_update() re-reads all map names and emits status dicts for the spectra list coloring.
This pull-on-signal pattern ensures the View is always consistent with the model without needing explicit change notifications on every field mutation.
Memory Management¶
Y0is stored asfloat32to halve memory vsfloat64, with no perceptible precision loss for spectroscopy.- Processed arrays
Yare only allocated after the first preprocessing operation. Y_bestfit,Y_baseline, andY_peaksare allocated lazily on first fit.- Deleting a map with
store.remove_map(name)removes the dict entry; Python garbage collection releases the arrays when no other references exist. - Large baseline computation intermediates are not cached — they are recomputed on demand by
eval_baseline_batch().
Preprocessing Pipeline¶
All preprocessing is non-destructive. x0 and Y0 are always preserved.
- Raw Data:
x0andY0(immutable). - Range Crop: Arrays are sliced between
md.range_minandmd.range_max. - X-Correction:
md.xcorrection_valueis added to the wavenumber axis. - Baseline Evaluation:
SpectraStore.batch_preprocess()evaluates the baseline model for the entire tensor. - Baseline Subtraction: The evaluated baseline is subtracted from the cropped intensities.
- Working Arrays: The final results are stored in
md.xandmd.Y. - Downstream Usage: These working arrays are used for all visualization and fitting operations.
SpectraStore.batch_preprocess() applies range crop and baseline subtraction to the entire N×M matrix in a single vectorized call. For the Spectra workspace (N=1), this processes one spectrum. For the Maps workspace (N=2500), it processes the entire map tensor in one call.
Persistence¶
Modern Format (v2+, ZIP-backed)¶
Workspace files (.spectra / .maps) are ZIP archives:
| Entry | Content |
|---|---|
metadata.json |
Lightweight JSON: format_version, store_meta dict (fnames, colors, labels, baseline_config, fit_model, range bounds, xcorrection) |
arrays.npz |
NumPy compressed arrays: store_{name}_x0, store_{name}_y0, store_{name}_coords, store_{name}_peak_params, store_{name}_Y_bestfit, store_{name}_Y_peak_{i} |
dataframes.pkl |
Optional: pickled df_fit_results |
The separation of lightweight JSON from heavy binary arrays enables fast metadata inspection without loading gigabytes of spectral data.
Serialization API¶
# Save path: ViewModel → SpectraStore → WorkspaceIO
metadata = store.to_metadata_dict(name) # JSON-serializable dict
arrays = store.to_npz_dict(name) # heavy NumPy arrays
# Load path: WorkspaceIO → SpectraStore
SpectraStore.load_map_from_npz(store, name, arrays, metadata)
Troubleshooting¶
Preprocessing not applied to viewer
- The viewer receives md.x / md.Y from the payload. If batch_preprocess() was not called (e.g., no baseline config and no range), md.x and md.Y remain None and the viewer falls back to md.x0 / md.Y0. This is the expected non-destructive behavior.
Memory grows unboundedly with large maps
- md.Y_bestfit, md.Y_baseline, and md.Y_peaks are never freed until the map is deleted or reinit_spectra() is called (which resets them to None). For very large maps, call store.remove_map() for maps that are no longer needed.
get_map_data() returns None
- The map name is case-sensitive and must match the exact key used in add_map(). Check store.map_names for the registered keys.
Zero-copy semantics cause unintended side effects
- Because get_map_data() returns a direct reference, modifying md.Y anywhere affects all consumers. Always operate through the ViewModel's established mutate-then-signal pipeline to ensure UI consistency.