"Wavelet Denoised-ResNet CNN and LightGBM Method to Predict Forex Rate of Change"を追試する。その１

何かというと、この論文。
https://arxiv.org/pdf/2102.04861.pdf

以下のサーベイ論文で、RSMEが非常に小さいとして取り上げられていた。
Zexin Hu, Yiqi Zhao and Matloob Khushi "A Survey of Forex and Stock Price Prediction Using Deep Learning" (2021)
https://arxiv.org/ftp/arxiv/papers/2103/2103.09750.pdf

It was clear that paper [71], [75] achieved the best performance using DNN model. Papers that had a RMSE smaller than 0.001,

※[71]が今回取り上げている論文。

実際に論文を見ると、尋常じゃない予測精度を出している。

さらに、研究に使ったコードをGithubに公開している。
GitHub - mkhushi/code-with-paper

怪しさ半分、期待半分で試してみることにした。

インストールして、とりあえず動かす

venvが入ってなかったので入れておく。

sudo apt install python3.10-venv

cd ~/git
git clone https://github.com/mkhushi/code-with-paper.git
cd ~/git/code-with-paper/Forex\ Price\ Prediction\ by\ Wavelet\ Denoised-ResNet\ CNN\ and\ LightGBM/
python3 -m venv venv
source venv/bin/activate

WSL上で（venvの仮想python環境になっていることを確認し）

pip install pandas
pip install PyWavelets
pip install matplotlib
wget http://prdownloads.sourceforge.net/ta-lib/ta-lib-0.4.0-src.tar.gz
tar -zxvf ta-lib-0.4.0-src.tar.gz
cd ta-lib
./configure --prefix=/usr
make
sudo make install
pip install TA-Lib

code .

でVS Codeを開く。
VS Codeで
pre_processingフォルダを開く。
DataProcess.pyをコピーしてテスト用のものを作る。
DataProcess_test.py とした。

とりあえずテスト目的なので1000行にしたい。
15行目に以下を挿入
# 1000行で実験
data = data[:1000]

出力ファイル名も変更。
outputpath="./data/1000L_ProcessedUSDJPY-5M-2004.1.1-2020.6.9.csv"
# outputpath="./data/ProcessedUSDJPY-5M-2004.1.1-2020.6.9.csv"

F5で実行。
Dataframeの記法が古いせいかすげえ警告が出るが、ひとまず無視。

dataフォルダに以下のファイルが作成されていることを確認。
「1000L_ProcessedUSDJPY-5M-2004.1.1-2020.6.9.csv」

またWSLターミナルで（venvの仮想python環境になっていることを確認し）

pip install opencv-python

PyTorchに関してはウェブサイトで確認する。
PyTorch

nvcc -V

でCUDAバージョンを確認したが入っていないので、これを先にインストールしたい。

12系はまだ対応していないライブラリがあったりするので、11系で11.8を選択。
CUDA Toolkit 11.8 Downloads | NVIDIA Developer

wget https://developer.download.nvidia.com/compute/cuda/repos/wsl-ubuntu/x86_64/cuda-wsl-ubuntu.pin
sudo mv cuda-wsl-ubuntu.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/11.8.0/local_installers/cuda-repo-wsl-ubuntu-11-8-local_11.8.0-1_amd64.deb
sudo dpkg -i cuda-repo-wsl-ubuntu-11-8-local_11.8.0-1_amd64.deb

sudo cp /var/cuda-repo-wsl-ubuntu-11-8-local/cuda-*-keyring.gpg /usr/share/keyrings/
sudo apt-get update
sudo apt-get -y install cuda

nvcc -V

うーん。。。やっぱダメだった。

と思ったけど、単にパスが通ってなかっただけっぽい。

export PATH="/usr/local/cuda/bin:$PATH"
nvcc -V

ついでに~/.profileにもPathのコードを追記した。
export PATH="/usr/local/cuda/bin:$PATH"

改めてPyTorch

pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

sklearnも入れておく。

pip install scikit-learn

以下の記事を参考に、Jupyter上でvenvを動くように設定してからJupyter起動
jupyter notebookでvenvを使う #Python - Qiita

pip install ipykernel
ipython kernel install --user --name=venv
jupyter notebook

Jupyter上で
Predicitionフォルダに移動し、Prediction.ipynbをコピーしてPrediction_test.ipynbを作成。

Slect Kernelでvenvを選択して実行。

Googleから取得する前提になっているこの辺りのコードは全て書き換え。

# Code to download file into Colaboratory:
!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
# Authenticate and create the PyDrive client.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)



id = "Use the dataset from DataProcessing"
downloaded = drive.CreateFile({'id':id}) 
downloaded.GetContentFile('DeLabeledUSDJPY-5M-2019.1.1-2020.6.9.csv')
df_JPYUSDM5=pd.read_csv('DeLabeledUSDJPY-5M-2019.1.1-2020.6.9.csv')

このようにしておく

filepath="../data/1000L_ProcessedUSDJPY-5M-2004.1.1-2020.6.9.csv"
df_JPYUSDM5=pd.read_csv(filepath)

以下もよくわからんのでコメントアウトしておく。

    dataM1 = pd.DataFrame(df_M1)

ハイパーパラメータはメモリ負荷、処理時間が少なくなるように減らす。

# Hypterparameters
days=30
# BATCH_SIZE=64
BATCH_SIZE=4
LR=1e-3
# epochs=100
epochs=10

なぜかほかにもハイパーパラメータを代入する箇所があるので、ここも変えておく。

# Hypterparameters
days=30
# BATCH_SIZE=32
BATCH_SIZE=4
LR=1e-3
# epochs=100
epochs=10

実行してみたら、エラー。10セル目かな。

（略）

- - > 47 train_outputs,feature1 = Mynet(train_inputs.double())

（略）
ValueError: Expected more than 1 value per channel when training, got input size torch.Size([1, 1024, 1, 1])

うーんなにこれ。
なんかデータサイズの問題っぽい。

学習データ、テストデータの分割している箇所があるが、99000行もないので、850行と残りで分割

# FinalM5_train=IndicatorValueM5[:99000]
FinalM5_train=IndicatorValueM5
FinalM5_test=IndicatorValueM5[99000:]

# labelM5_train=labelM5[:99000]
labelM5_train=labelM5
labelM5_test=labelM5[99000:]

↓

# FinalM5_train=IndicatorValueM5[:99000]
#FinalM5_train=IndicatorValueM5
#FinalM5_test=IndicatorValueM5[99000:]

FinalM5_train=IndicatorValueM5[:850]
FinalM5_test=IndicatorValueM5[850:]

# labelM5_train=labelM5[:99000]
#labelM5_train=labelM5
#labelM5_test=labelM5[99000:]

labelM5_train=labelM5[:850]
labelM5_test=labelM5[850:]

もう最初から1度実行したら先に進んだ。

さすがにデータ量が少ないから2分くらいで終わったけど、GPUガッツリ使ってる。

次はimport lightgbmでエラーになった。pipでインストールするだけ。

pip install LightGBM

また別の箇所でエラー

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[13], line 12
     10 x2=x2.reshape((-1,days*x2.shape[-1]))
     11 print(x2.shape)
---> 12 x=np.concatenate([x2, x1], 1)
     14 y=dataM5.label4[33:].to_numpy()
     15 y=y[days:-4]

ValueError: all the input array dimensions except for the concatenation axis must match exactly, but along dimension 0, the array at index 0 has size 929 and the array at index 1 has size 850

次元数が違うっぽい。
x2のサイズが929で多い。
x1.shape[0]を使って調整

import lightgbm as lgb
from sklearn.model_selection import train_test_split
# x=a.reshape((-1,days*a.shape[-1]))
x1=feature
print(x1.shape)
x2=dataM5[['RSI5','RSI10','RSI20','macd','macd','macdsignal','slowk','slowd','fastk','fastd','WR5','WR10', 'WR20','ROC5','ROC10','ROC20','CCI5','CCI10','CCI20','ATR5','ATR10','ATR20','NATR5','NATR10','NATR20','TRANGE']].to_numpy()
x2=x2[33:-4]
x2=x2[:x1.shape[0]+days]
x2=np.array([x2[i-days:i] for i in range(days,len(x2))])
x2=x2.reshape((-1,days*x2.shape[-1]))
print(x2.shape)
x=np.concatenate([x2, x1], 1)

y=dataM5.label4[33:].to_numpy()
y=y[days:-4]
y=y[:x1.shape[0]]

lgbの学習で引数が合わずエラー。

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[21], line 45
     42 trn_data = lgb.Dataset(x_train, y_train)
     43 val_data = lgb.Dataset(x_test, y_test)
---> 45 clf = lgb.train(params,
     46                 trn_data,
     47                 20000,
     48                 valid_sets=[trn_data, val_data],
     49                 verbose_eval=200,
     50                 early_stopping_rounds=500)
     51 oof = clf.predict(x_train, num_iteration=clf.best_iteration)
     52 predictions = clf.predict(x_test, num_iteration=clf.best_iteration)

TypeError: train() got an unexpected keyword argument 'verbose_eval'

early_stopping_roundsも使われなくなっていたので、単純に消す。

clf = lgb.train(params,
                trn_data,
                20000,
                valid_sets=[trn_data, val_data]
                #verbose_eval=200,
                #early_stopping_rounds=500
)
||< 

予測データを保存するためのGoogle Drive関連コードがまた最後のほうに出てくるので、丸っと消す。
>||
# Save file to google drive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

# Authenticate and create the PyDrive client.
# This only needs to be done once in a notebook.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
# Create & upload a text file.
uploaded = drive.CreateFile({'title': 'D1_Predicted.csv'})
uploaded.SetContentFile('D1_Predicted.csv')
uploaded.Upload()
print('Uploaded file with ID {}'.format(uploaded.get('id')))

出力されるデータを確認して終わり。
が、なんか996行あって、なんでかよく分かってない。
全体的に、テストデータと訓練データの切り分けとかに疑問がある。

どうでもいいが、データ読込の際の列指定で「macd」列が2個ある。

PyarrowがないとPandasでエラーが出てうっとうしいので、一応インストールしておいた。