skip to Main Content

I am working with the meta-feature extractor package: pymfe for complexity analysis.
On a small dataset, this is not a problem, for example.

pip install -U pymfe

from sklearn.datasets import make_classification
from sklearn.datasets import load_iris
from pymfe.mfe import MFE

data = load_iris()
X= data.data
y = data.target

extractor = MFE(features=[ "t1"], groups=["complexity"],
                  summary=["min", "max", "mean", "sd"])
extractor.fit(X,y)
extractor.extract()
(['t1'], [0.12])

My dataset is large (32690, 80) and this computation gets killed for exessive memory usage. I work on Ubuntu 24.04 having 32GB RAM.

To reproduce scenario:

# Generate the dataset
X, y = make_classification(n_samples=20_000,n_features=80,
    n_informative=60, n_classes=5, random_state=42)

extractor = MFE(features=[ "t1"], groups=["complexity"],
                  summary=["min", "max", "mean", "sd"])
extractor.fit(X,y)
extractor.extract()
Killed

Question:

How do I split this task to compute on small partitions of the dataset, and combine final results (averaging)?

2

Answers


  1. Chosen as BEST ANSWER

    Managed to find a workaround.

    # helper functions
    def split_dataset(X, y, n_splits):
        # data splits
        split_X = np.array_split(X, n_splits)
        split_y = np.array_split(y, n_splits)
        return split_X, split_y
    
    def compute_meta_features(X, y):
        # meta-features for a partition
        extractor = MFE(features=["t1"], groups=["complexity"], 
            summary=["min", "max", "mean", "sd"])
        extractor.fit(X, y)
        return extractor.extract()
    
    def average_results(results):
        # summary of results
        features = results[0][0]
        summary_values = np.mean([result[1] for result in results], axis=0)
        return features, summary_values
    
    # Split dataset
    n_splits = 10  # ten splits
    split_X, split_y = split_dataset(X, y, n_splits)
    
    #  meta-features 
    results = [compute_meta_features(X_part, y_part) for X_part, y_part in zip(split_X, split_y)]
    
    # Combined results
    final_features, final_summary = average_results(results)
    
    

  2. StratifiedKFold in the sklearn.model_selection is ideal for your case to create multiple balanced partitions while preserving the class distribution in each split.

    Here is the code

    import numpy as np
    from sklearn.datasets import make_classification
    from sklearn.model_selection import StratifiedKFold
    from pymfe.mfe import MFE
    
    # Generate the large dataset
    X, y = make_classification(n_samples=20000, n_features=80, n_informative=60, n_classes=5, random_state=42)
    
    # Function to process each partition and extract features
    def process_partition(X, y):
        extractor = MFE(features=["t1"], groups=["complexity"], summary=["min", "max", "mean", "sd"])
        extractor.fit(X, y)
        return extractor.extract()
    
    # Function to combine results from all partitions
    def combine_results(partition_results):
        combined = {}
        for features, values in partition_results:
            for feature, value in zip(features, values):
                combined.setdefault(feature, []).append(value)
        return {feature: np.mean(values) for feature, values in combined.items()}
    
    # Create stratified partitions using StratifiedKFold
    skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)
    partition_results = [process_partition(X[train_idx], y[train_idx]) for train_idx, _ in skf.split(X, y)]
    
    # Combine the results from all partitions
    final_result = combine_results(partition_results)
    print(final_result)
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search