I am working with the meta-feature extractor package: pymfe for complexity analysis.
On a small dataset, this is not a problem, for example.
pip install -U pymfe
from sklearn.datasets import make_classification
from sklearn.datasets import load_iris
from pymfe.mfe import MFE
data = load_iris()
X= data.data
y = data.target
extractor = MFE(features=[ "t1"], groups=["complexity"],
summary=["min", "max", "mean", "sd"])
extractor.fit(X,y)
extractor.extract()
(['t1'], [0.12])
My dataset is large (32690, 80)
and this computation gets killed for exessive memory usage. I work on Ubuntu 24.04
having 32GB
RAM.
To reproduce scenario:
# Generate the dataset
X, y = make_classification(n_samples=20_000,n_features=80,
n_informative=60, n_classes=5, random_state=42)
extractor = MFE(features=[ "t1"], groups=["complexity"],
summary=["min", "max", "mean", "sd"])
extractor.fit(X,y)
extractor.extract()
Killed
Question:
How do I split this task to compute on small partitions of the dataset, and combine final results (averaging)?
2
Answers
Managed to find a workaround.
StratifiedKFold
in thesklearn.model_selection
is ideal for your case to create multiple balanced partitions while preserving the class distribution in each split.Here is the code