Sep 11, 2023

固定样本划分高光谱影像数据集

对原始数据集进行拆分组合

前言
代码
- 划分
- 组合
- 结果
应用

前言

现在 HSIC 研究中越来越多方法使用的数据集都是在原来的数据集上进行划分、组合，重新得到一个新的 .mat 格式数据集，组合后的数据集通常就包含原始数据和划分后的训练和测试标签。这样固定划分能确保每次结果是可复现的，以在此划分基础上进行其他算法性能的比较。话不多说，直接上代码。

代码

以 Indian Pines 数据集为例子，有这么一个需求，我想把原始数据改成符合以下需求的数据格式：

该数据分为三部分（对于 .mat 格式其实就是有三个键值对），第一部分为测试样本标签矩阵，第二部分为训练样本标签矩阵，第三部分为原始高光谱数据
其中训练样本数要求每一类地物样本中（不包含数值为 0 的类，也就是去掉背景像素）随机抽取 $X$ 个，测试样本则为该类样本数减去 $X$ 个，如果某一类样本不足 $X$ 个，则训练样本数和测试样本数等于该类地物样本数相同，也就是训练样本数=测试样本数=该类样本数/2

那么上面的需求就可以分解为三部分，而其中第三部分就是我们原始数据，直接赋值即可。第一第二部分则要进行一个随机的划分，代码逻辑其实也很清晰，找到目标索引，放数据，返回新数据。

划分

import numpy as np
import scipy.io as sio

X_IP_DATA = sio.loadmat('./Indian_pines_corrected.mat')['indian_pines_corrected'] # (145, 145, 200) 原始数据
Y_IP_DATA = sio.loadmat('./Indian_pines_gt.mat')['indian_pines_gt'] # (145, 145) 原始标签数据


# 分割数据，传入要分割的原始数据以及每类分割的样本数
def split_data(data, choose_per_samples):
    unique_classes = np.unique(data[data != 0])
    train_data = np.zeros_like(data)
    test_data = np.zeros_like(data)

    for cls in unique_classes:
        indices = np.where(data == cls)
        num_samples = len(indices[0])
        num_train_samples = min(num_samples, choose_per_samples)
        num_test_samples = num_samples - num_train_samples

        if num_train_samples < choose_per_samples:
            num_train_samples = num_samples // 2
            num_test_samples = num_samples - num_train_samples

        train_indices = np.random.choice(num_samples, size=num_train_samples, replace=False)
        test_indices = np.setdiff1d(np.arange(num_samples), train_indices)

        train_data[indices[0][train_indices], indices[1][train_indices]] = cls
        test_data[indices[0][test_indices], indices[1][test_indices]] = cls

    return train_data, test_data


train_data, test_data = split_data(Y_IP_DATA, choose_per_samples=30) # 假设每类分割的样本数为 30（这里根据自己的需求进行更改）

print("训练数据维度:", train_data.shape)
print("测试数据维度:", test_data.shape)

组合

有了三部分的数据，我们就可以组合成一个全新的 .mat 格式：

indian_pines_30 = {'TE':test_data, 'TR':train_data, 'input':X_IP_DATA }
sio.savemat('ip_30_split.mat', indian_pines_30) # 可以在该目录下找到 'ip_30_split.mat' 文件

# 可以对该文件进行查看
read_data_2 = sio.loadmat('./test_ip_30_split.mat')
read_data_2

结果

下面是 test_ip_30_split.mat 文件的内容，可以清楚看到分为了 TE、TR 和 input 三个部分，分别代表划分好的测试样本标签数据，划分好的训练样本标签数据以及原始数据。

{'__header__': b'MATLAB 5.0 MAT-file Platform: posix, Created on: Mon Sep 11 22:12:54 2023',
 '__version__': '1.0',
 '__globals__': [],
 'TE': array([[3, 3, 3, ..., 0, 0, 0],
        [3, 3, 3, ..., 0, 0, 0],
        [3, 3, 3, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]], dtype=uint8),
 'TR': array([[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]], dtype=uint8),
 'input': array([[[3172, 4142, 4506, ..., 1057, 1020, 1020],
         [2580, 4266, 4502, ..., 1064, 1029, 1020],
         [3687, 4266, 4421, ..., 1061, 1030, 1016],
         ...,
         [2570, 3890, 4320, ..., 1042, 1021, 1015],
         [3170, 4130, 4320, ..., 1054, 1024, 1020],
         [3172, 3890, 4316, ..., 1043, 1034, 1016]],
 
        [[2576, 4388, 4334, ..., 1047, 1030, 1006],
         [2747, 4264, 4592, ..., 1055, 1039, 1015],
         [2750, 4268, 4423, ..., 1047, 1026, 1015],
         ...,
         [3859, 4512, 4605, ..., 1056, 1035, 1015],
         [3686, 4264, 4690, ..., 1051, 1012, 1020],
         [2744, 4268, 4597, ..., 1047, 1019, 1016]],
 
        [[2744, 4146, 4416, ..., 1055, 1029, 1025],
         [2576, 4389, 4416, ..., 1051, 1021, 1011],
         [2744, 4273, 4420, ..., 1068, 1033, 1010],
         ...,
         [2570, 4266, 4509, ..., 1051, 1025, 1010],
         [2576, 4262, 4496, ..., 1047, 1029, 1020],
         [2742, 4142, 4230, ..., 1042, 1025, 1011]],
 
        ...,
 
        [[3324, 3728, 4002, ..., 1003, 1004, 1004],
         [2983, 3604, 3829, ..., 1011, 1013, 1008],
         [2988, 3612, 3913, ..., 1012, 1001, 1004],
         ...,
         [2564, 4115, 4103, ..., 1003, 1005, 1013],
         [2730, 4111, 4103, ..., 1015, 1013, 1004],
         [3156, 3991, 4103, ..., 1017, 1014, 1000]],
 
        [[3161, 3731, 3834, ..., 1002, 1000, 1000],
         [2727, 3742, 4011, ...,  999,  991, 1003],
         [2988, 4114, 4011, ..., 1006, 1008, 1013],
         ...,
         [3156, 3858, 4016, ..., 1011, 1004, 1003],
         [3159, 3858, 4100, ..., 1016, 1000, 1000],
         [2561, 3866, 4003, ..., 1008, 1008, 1000]],
 
        [[2979, 3728, 3732, ..., 1006, 1004, 1000],
         [2977, 3728, 3741, ..., 1007, 1009,  990],
         [2814, 3728, 3914, ...,  999, 1009, 1003],
         ...,
         [3153, 3864, 4282, ..., 1003, 1008, 1000],
         [3155, 4104, 4106, ..., 1011, 1005, 1003],
         [3323, 3860, 4197, ..., 1007, 1004, 1000]]], dtype=uint16)}

应用

比如 SpectralFormer 中就有类似的数据集划分。该方法在处理原始数据集上比如 Indian Pines 数据集就有类似上面的划分组合方法，但是也稍有不同。在稍微多数量的类别中，每一类都抽取了 50 个训练样本，只是这 50 个训练样本是人工抽取具有挑战性的样本，而并非像上述代码一样是随机抽取的，此外在稍微多数量的类别中，也是人工选择的。所以上面的代码只是作为一个示例，也是本人的 demo 记录。根据实际情况可以再进行详细的划分。

GitHub - danfenghong/IEEE_TGRS_SpectralFormer: Danfeng Hong, Zhu Han, Jing Yao, Lianru Gao, Bing Zhang, Antonio Plaza, Jocelyn Chanussot. Spectralformer: Rethinking hyperspectral image classification with transformers, IEEE Transactions on Geoscience and Remote Sensing (TGRS), 2021

Danfeng Hong, Zhu Han, Jing Yao, Lianru Gao, Bing Zhang, Antonio Plaza, Jocelyn Chanussot. Spectralformer: Rethinking hyperspectral image classification with transformers, IEEE Transactions on Geos...

github.com

GitHub - danfenghong/IEEE_TGRS_SpectralFormer: Danfeng Hong, Zhu Han, Jing Yao, Lianru Gao, Bing Zhang, Antonio Plaza, Jocelyn Chanussot. Spectralformer: Rethinking hyperspectral image classification with transformers, IEEE Transactions on Geoscience and Remote Sensing (TGRS), 2021

tech 高光谱图像分类

Issac Tan

未开出最后的花蕊你别要气馁

Some rights reserved

Except where otherwise noted, content on this page is licensed under a Creative Commons Attribution-NonCommercial 4.0 International license.

固定样本划分高光谱影像数据集

对原始数据集进行拆分组合

前言

代码

划分

组合

结果

应用

Read This

使用 Stanford Cars 数据集的一些问题

固定样本划分高光谱影像数据集

对原始数据集进行拆分组合

前言

代码

划分

组合

结果

应用

Subscribe

Read This

使用 Stanford Cars 数据集的一些问题