Kaggle知識點：內存優化方法

文章目錄[隱藏]

內存使用統計
Numpy內存優化
- 轉換數據類型
- 使用稀疏矩陣
Pandas內存優化
模型內存優化
總結

文章轉載自公眾(zhong) 號：Coggle數據科學，版權歸原作者所有！

在Kaggle和日常的代碼運行中，我們(men) 的內(nei) 存總是受限的。那麽(me) 我們(men) 在有限的內(nei) 存中讓代碼跑起來呢？本文給出了一些解決(jue) 方法。

內存使用統計

在進行內(nei) 存優(you) 化之前，可以使用如下函數對進行使用的內(nei) 存進行統計。

import psutil impot os def cpu_stats(): pid = os.getpid() py = psutil.Process(pid) memory_use = py.memory_info()[0] / 2. 30 return 'memory GB:' + str(np.round(memory_use, 2))

對於(yu) pandas讀取的數據，可以使用如下函數查看內(nei) 存使用：

# 整體(ti) 內(nei) 存使用 df.info(memory_usage="deep")

# 每列內(nei) 存使用 df.memory_usage()

對於(yu) 應用程序，可以使用filprofiler函數查看內(nei) 存峰值。

Kaggle知識點：內(nei) 存優(you) 化方法

https://github.com/pythonspeed/filprofiler

Numpy內存優化

轉換數據類型

在Numpy支持多種數據類型，不同類型數據的內(nei) 存占用相差很大。uint64類型比uint16內(nei) 存占比大四倍。

>>> from numpy import ones >>> int64arr = ones((1024, 1024), dtype=np.uint64) >>> int64arr.nbytes 8388608

>>> int16arr = ones((1024, 1024), dtype=np.uint16)
>>> int16arr.nbytes
2097152

對於(yu) 數據類型，可以根據矩陣的元素範圍進行設置。比如對於(yu) 整數可以參考以下常見類型的範圍，並選取最為(wei) 合適的。

類型	範圍
int8	(-128 to 127)
int16	(-32768 to 32767)
int32	(-2147483648 to 2147483647)
int64	(-9223372036854775808 to 9223372036854775807)
uint8	(0 to 255)
uint16	(0 to 65535)
uint32	(0 to 4294967295)
uint64	(0 to 18446744073709551615)

對於(yu) 浮點數，可以考慮使用float16、float32和float32來進行存儲(chu) 。Numpy具體(ti) 支持的數據類型可以參考?文檔。

https://numpy.org/devdocs/user/basics.types.html

使用稀疏矩陣

如果矩陣中數據是稀疏的情況，可以考慮稀疏矩陣。LGB和XGB支持稀疏矩陣參與(yu) 訓練。

>>> import sparse; import numpy as np >>> arr = np.random.random((1024, 1024)) >>> arr[arr < 0.9] = 0 >>> sparse_arr = sparse.COO(arr)

>>> arr.nbytes
8388608

>>> sparse_arr.nbytes
2514648

Pandas內(nei) 存優(you) 化

分批讀取

如果數據文件非常大，可以在讀取時分批次讀取，通過設置chunksize來控製批大小。

df = pd.read_csv(path, chunksize=1000000)

for chunk in df: # 分批次處理數據 pass

選擇讀取部分列

df = pd.read_csv(path, usecols=["a"])

提前設置列類型

df = pd.read_csv(path, dtype={"a":"int8"})

將類別列設為(wei) category類型

df['a'] = df['a'].astype('category')

此操作對於(yu) 類別列壓縮非常有效，壓縮比很大。同時在設置為(wei) category類型後，LightGBM可以視為(wei) 類別類型訓練。

自動識別類型並進行轉換

def reduce_mem_usage(props): start_mem_usg = props.memory_usage().sum() / 10242 print("Memory usage of properties dataframe is :",start_mem_usg," MB") NAlist = [] # Keeps track of columns that have missing values filled in. for col in props.columns: if props[col].dtype != object: # Exclude strings # Print current column type print("") print("Column: ",col) print("dtype before: ",props[col].dtype) # make variables for Int, max and min IsInt = False mx = props[col].max() mn = props[col].min() # Integer does not support NA, therefore, NA needs to be filled if not np.isfinite(props[col]).all(): NAlist.append(col) props[col].fillna(mn-1,inplace=True) # test if column can be converted to an integer asint = props[col].fillna(0).astype(np.int64) result = (props[col] - asint) result = result.sum() if result > -0.01 and result < 0.01: IsInt = True

# Make Integer/unsigned Integer datatypes if IsInt: if mn >= 0: if mx < 255:
props[col] = props[col].astype(np.uint8) elif mx < 65535:
props[col] = props[col].astype(np.uint16) elif mx < 4294967295:
props[col] = props[col].astype(np.uint32) else:
props[col] = props[col].astype(np.uint64) else: if mn > np.iinfo(np.int8).min and mx < np.iinfo(np.int8).max:
props[col] = props[col].astype(np.int8) elif mn > np.iinfo(np.int16).min and mx < np.iinfo(np.int16).max:
props[col] = props[col].astype(np.int16) elif mn > np.iinfo(np.int32).min and mx < np.iinfo(np.int32).max:
props[col] = props[col].astype(np.int32) elif mn > np.iinfo(np.int64).min and mx < np.iinfo(np.int64).max:
props[col] = props[col].astype(np.int64) # Make float datatypes 32 bit else:
props[col] = props[col].astype(np.float32) # Print new column type print("dtype after: ",props[col].dtype) print("") # Print final result print("___MEMORY USAGE AFTER COMPLETION:___")
mem_usg = props.memory_usage().sum() / 10242 print("Memory usage is: ",mem_usg," MB") print("This is ",100*mem_usg/start_mem_usg,"% of the initial size") return props, NAlist

https://www.kaggle.com/arjanso/reducing-dataframe-memory-size-by-65

結合numpy.memmap使用

numpy.memmap可以將數據提前在磁盤上進行申請空間，並不需要讀取進內(nei) 存。而且支持多次寫(xie) 入。

所以將每列數據處理好，存儲(chu) 到磁盤，處理完成後再讀取進入內(nei) 存。

https://www.kaggle.com/c/talkingdata-adtracking-fraud-detection/discussion/56105

https://numpy.org/doc/stable/reference/generated/numpy.memmap.html

模型內存優化

XBGboost

可以將數據集存儲(chu) 為(wei) libsvm格式，使用External Memory Version完成訓練，或者從(cong) 命令行訓練。

https://xgboost.readthedocs.io/en/latest/tutorials/external_memory.html

LightGBM

使用LightGBM的自帶的Dataset讀取文件進行訓練，比使用Numpy和Pandas數據更好。當然把內(nei) 存數據轉換為(wei) Dataset也有一定的效果。

https://lightgbm.readthedocs.io/en/latest/Python-Intro.html

設置histogram_pool_size參數控製內(nei) 存使用，也可以減少num_leaves和max_bin的取值。

https://lightgbm.readthedocs.io/en/latest/FAQ.html?highlight=Multiple#when-running-lightgbm-on-a-large-dataset-my-computer-runs-out-of-ram

深度學習模型

如果使用深度學習(xi) 模型，可以考慮使用dataloder的方式分批次讀取數據到內(nei) 存。

總結

查看數據列和行，讀取需要的數據；
查看數據類型，進行類型轉換；
分批次或利用磁盤，處理數據；

【競賽報名/項目谘詢+微信：mollywei007】

本文由 Molly 轉載發布在伟德的官网平台，版權歸原作者所有，如有侵權，請直接聯係微信進行處理。

Kaggle知識點：內存優化方法

內存使用統計

Numpy內存優化

轉換數據類型

使用稀疏矩陣

Pandas內(nei) 存優(you) 化

模型內存優化

總結

雅思4月機考大作文預測：出獄又犯罪

十大需要注意的留學避坑指南！

最新發布

劍橋筆試區別對待大陸考生如何應對

芬蘭(lan) 教育新變革——2026年8月芬蘭(lan) 將引入英語授課的高中教育和畢業(ye) 考試

2025香港寄宿學校搶位開啟！內(nei) 附插班攻略

你的團隊是否需要Mentor？iGEM Mentorship Program等你申請！

新加坡全額資助海外博士獎學金計劃：五年期教職承諾

香港浸會(hui) 大學2025年內(nei) 地本科招生信息

最新文章

劍橋筆試區別對待大陸考生如何應對

芬蘭教育新變革——2026年8月芬蘭將引入英語授課的高中教育和畢業考試

2025香港寄宿學校搶位開啟！內附插班攻略

你的團隊是否需要Mentor？iGEM Mentorship Program等你申請！

新加坡全額資助海外博士獎學金計劃：五年期教職承諾

內存使用統計

Numpy內存優化

轉換數據類型

使用稀疏矩陣

Pandas內(nei) 存優(you) 化

模型內存優化

總結

雅思4月機考大作文預測：出獄又犯罪

十大需要注意的留學避坑指南！

你也可能喜歡

最新發布

最新文章