Apply tanh scale in linear regression models
In this blog i´ll show you how i use tanh scale for reduce the error of linear regression models
I prove this scale in two projects - California housing prices from kaggle
- León housing rents prices, i make this dataset here’s the project.
In this two cases i´ve better results than apply linear regression model with other scale mathods.
Scaling process
For scale i use the next form of tanh function:
\[FeatureScaled = \tanh\left(\frac{feature}{mean(feature)} \right)\]When the denominator is bigger, the function tends to be most smooth. In this cases the outliers helps to increase the mean value and smooth the curve.
First take a look of the datasets distribution
California Housing distribution
León Housing distribution
In both cases can see the distribution concentrate the mean at left and had a long tail at right.
Now i proceed to scale the data. According to the scalation formula: For this process just need
import numpy as np
# scale train data
for col in X_train_scaled.columns:
X_train_scaled[col] = np.tanh(X_train_scaled[col] / np.mean(X_train_scaled[col]))
y_train_scaled = np.tanh(y_train_scaled / np.mean(y_train_scaled))
# scale test data
for col in X_test_scaled.columns:
X_test_scaled[col] = np.tanh(X_test_scaled[col] / np.mean(X_test_scaled[col]))
y_test_scaled = np.tanh(y_test_scaled / np.mean(y_test_scaled))
# X values refers to feature columns
# y values refers to target columns
I scale train and test in different sets for don’t to exchange information between they.
For this problem i scaled the features data and target data with tanh function, and with the others scales just scale the feature data, in this way i get the best results.
Later scale both dataset:
California housing scaled
León housing scaled
The result a distribution with values between 0 to 1, but looks like a half to normal distribution in California’s dataset, in León’s dataset don’t looks a shape of distribution knewed, but the size of california’s dataset is bigger than León’s dataset so… its ok.
Linear Regression
Now for make the linear regression, i’ll compare different methods.
# import methods to make Linear Regression
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
# for compute the error
from sklearn.metrics import mean_squared_error
And to compare scalers, i’ll use this different methods:
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import RobustScaler
from sklearn.preprocessing import PowerTransformer
After compute the model the results are:
And rescaling tanh scale result look at the rmse of each model
California housing results
Now let’s compare with León housing results
Rescaling tanh scale results i get the next rmse
León housing results
Looking the results it can be seen that tanh scale give the best results.
Complete Notebooks