====== 07 - SML: Regresión ======

Datos:

<sxh python>
import numpy as np
import matplotlib.pyplot as plt

X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1)
</sxh>

Distribución:

<sxh python>
fig, axes = plt.subplots(figsize=(9,6))
axes = fig.add_subplot()
plt.axis([0 ,2, 0, 14])
axes.scatter(X,y)
</sxh>

{{ :clase:ia:saa:4_sml_regresion:grafico1.png?400 |}}

Entrenar el model: método **fit()** de la librería **sklearn**

<sxh python>
from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X, y)

print('Constante: ', model.intercept_)
print('Pendiente:', model.coef_)
</sxh>

<sxh base>
Constante:  [4.6263185]
Pendiente: [[2.46782729]]
</sxh>

Predicciones: método **predict()**

<sxh python>
y_predict = model.predict(X)
</sxh>

<sxh python>
fig, axes = plt.subplots(figsize=(9,6))
axes = fig.add_subplot()
plt.axis([0 ,2, 0, 14])
axes.plot(X, y_predict, color="r")
axes.scatter(X,y)
</sxh>

{{ :clase:ia:saa:4_sml_regresion:grafico2.png?400 |}}

Predicciones para nuevos valores:

<sxh>
new_values = [[1.95], [1.23], [2.34]]

new_predicts = model.predict(new_values)

print(new_predicts)
</sxh>

<sxh base>
[[ 9.43858173]
 [ 7.66174608]
 [10.40103437]]
</sxh>

===== Separar entrenamiento y test =====

<sxh python>
import pandas as pd
</sxh>

<sxh python>
from sklearn.datasets import load_iris

iris = load_iris()

data=pd.DataFrame({
    'sepal length':iris.data[:,0],
    'sepal width':iris.data[:,1],
    'petal length':iris.data[:,2],
    'petal width':iris.data[:,3],
    'species':iris.target
})
data.head()
</sxh>

<sxh base>
	sepal length	sepal width	petal length	petal width	species
0	         5.1	        3.5	         1.4	        0.2	      0
1	         4.9	        3.0	         1.4	        0.2	      0
2	         4.7	        3.2	         1.3	        0.2	      0
3	         4.6	        3.1	         1.5	        0.2	      0
4	         5.0	        3.6	         1.4	        0.2	      0
</sxh>

<sxh python>
data.info()
</sxh>

<sxh base>
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sepal length  150 non-null    float64
 1   sepal width   150 non-null    float64
 2   petal length  150 non-null    float64
 3   petal width   150 non-null    float64
 4   species       150 non-null    int64  
dtypes: float64(4), int64(1)
memory usage: 6.0 KB
</sxh>

==== Separar labels ====

<sxh python>
X = data[['sepal length', 'sepal width', 'petal length', 'petal width']]  
y = data['species']
</sxh>

<sxh base>
	sepal length	sepal width	petal length	petal width
0	         5.1	        3.5	         1.4	        0.2
1	         4.9	        3.0	         1.4	        0.2
2	         4.7	        3.2	         1.3	        0.2
3	         4.6	        3.1	         1.5	        0.2
4	         5.0	        3.6	         1.4	        0.2
</sxh>

<sxh python>
y.head()
</sxh>

<sxh base>
0    0
1    0
2    0
3    0
4    0
Name: species, dtype: int64
</sxh>

==== Separar datos entrenamiento/test ====

<sxh python>
print(data['species'].unique())
X_train = data[0:100]
print(X_train['species'].unique())
</sxh>

<sxh base>
[0 1 2]
[0 1]
</sxh>

El dataset Iris tiene 150 datos de flores de 3 especies. El problema es que están ordenados por especie (50 primeros datos de la especie 1, 50 siguientes de la especie 2 y los últimos 50 de la especie 3). Si cogemos los 100 primeros datos como entrenamiento, nuestro modelo sólo verá las 2 primeras especies.

<note important>Al separar los datos, tenemos que asegurarnos que las muestras sean representativas.
</note>

Podemos utilizar el método **train_test_split()** de la librería sklearn:

<sxh python>
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
print(y_train.unique())
</sxh>

<sxh base>
[0 2 1]
</sxh>

<sxh python>
X_train.head()
</sxh>

<sxh base>

        sepal length	sepal width	petal length	petal width
28	         5.2	        3.4	         1.4	        0.2
6	         4.6	        3.4	         1.4	        0.3
59	         5.2	        2.7	         3.9	        1.4
57	         4.9	        2.4	         3.3	        1.0
135	         7.7	        3.0	         6.1	        2.3
</sxh>

===== Métricas =====

<sxh python>
import numpy as np
import matplotlib.pyplot as plt
</sxh>

<sxh python>
np.random.seed(99)

X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1)
</sxh>

<sxh python>
fig, axes = plt.subplots(figsize=(9,6))
axes = fig.add_subplot()
plt.axis([0 ,2, 0, 14])
axes.scatter(X,y)
</sxh>

{{ :clase:ia:saa:4_sml_regresion:grafico1.png?400 |}}

<sxh python>
from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X, y)

y_predict = model.predict(X)
</sxh>

<sxh python>
fig, axes = plt.subplots(figsize=(9,6))
axes = fig.add_subplot()
plt.axis([0 ,2, 0, 14])
axes.plot(X, y_predict, color="r")
axes.scatter(X,y)
</sxh>

{{ :clase:ia:saa:4_sml_regresion:grafico2.png?400 |}}

==== Métricas con sklearn ====


<sxh python>
from sklearn import metrics

MAE = metrics.mean_absolute_error(y, y_predict)
MSE = metrics.mean_squared_error(y, y_predict)
RMSE = metrics.mean_squared_error(y, y_predict, squared=False)
R = metrics.r2_score(y, y_predict)

print("MAE:", MAE)
print("MSE:", MSE)
print("RMSE:", RMSE)
print("R:", R)
</sxh>

<sxh base>
MAE: 0.7393683940499426
MSE: 0.8449242930864277
RMSE: 0.9191976354878355
R: 0.7143641795373913
</sxh>

==== Gráficos de residuos ====

<sxh python>
# Gráficos de residuos 1

residuos = y - y_predict

fig, axes = plt.subplots(figsize=(15,6))
axes = fig.add_subplot()
plt.axis([0 ,2, -4, 4])
axes.axhline(y = 0, linestyle = '--', color = 'red', lw=2)
axes.scatter(X, residuos)
axes.set_title('Residuos del modelo', fontsize = 10, fontweight = "bold")
axes.set_xlabel('X')
axes.set_ylabel('Residuo')

plt.show()
</sxh>

{{ :clase:ia:saa:4_sml_regresion:grafico3.png?400 |}}

<sxh python>
# Gráficos de residuos 2

fig, axes = plt.subplots(figsize=(15,6))
axes = fig.add_subplot()
plt.axis([y.min(),y.max(), y_predict.min(), y_predict.max()])
axes.plot([y.min(), y.max()], [y_predict.min(), y_predict.max()],linestyle = '--', color = 'red', lw=2)
axes.scatter(y, y_predict)
axes.set_title('Predicción vs Real', fontsize = 10, fontweight = "bold")
axes.set_xlabel('Real')
axes.set_ylabel('Predicción')

plt.show()
</sxh>

{{ :clase:ia:saa:4_sml_regresion:grafico4.png?400 |}}

===== Regresión polinomial =====

<sxh python>
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
</sxh>

Creamos los datos con y = -100 - 5x + 5 x² + 0.1 x³ + ruido
 
<sxh python>
def true_fun(X):
    return -100 - 5 * X + 5 * np.power(X, 2) + 0.1 * np.power(X, 3)

np.random.seed(0)

n_samples = 100

X_np = np.random.uniform(-50,50, n_samples)
y_np = true_fun(X_np) + np.random.randn(n_samples) * 1000
</sxh>

Creamos un dataframe de pandas para poder separar datos de entrenamiento/test de forma sencilla:

<sxh python>
df = pd.DataFrame({'X':X_np, 'y':y_np})
df.head()
</sxh>

<sxh base>
	X	        y
0	4.881350	-1158.787607
1	21.518937	4005.020826
2	10.276338	950.817651
3	4.488318	-1548.918554
4	-7.634520	1673.355793
</sxh>

Distribución:

<sxh python>
fig, axes = plt.subplots(figsize=(9,6))
axes = fig.add_subplot()
axes.scatter(df.X,df.y)
</sxh>

{{ :clase:ia:saa:4_sml_regresion:grafico5.png?400 |}}

Separamos los datos:

<sxh python>
X = df[['X']]
y = df[['y']]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
</sxh>

Distribución de los datos de entrenamiento:

{{ :clase:ia:saa:4_sml_regresion:grafico15.png?400 |}}

Para implementar regresión polinomial usamos la función **PolynomialFeatures()** de la librería sklearn.

<sxh python>
from sklearn.preprocessing import PolynomialFeatures

pf = PolynomialFeatures(degree = 3)    # usaremos polinomios de grado 3
X_train_poly = pf.fit_transform(X_train)  # transformamos la entrada en polinómica
# instruimos a la regresión lineal que aprenda de los datos (ahora polinómicos)
model2 = LinearRegression()
model2.fit(X_train_poly, y_train) 

print('w = ' + str(model2.coef_) + ', b = ' + str(model2.intercept_))
y_train_predict = model2.predict(X_train_poly)
</sxh>

<sxh base>
w = [[  0.         -13.77472612   4.82789218   0.1051025 ]], b = [330.46146111]
</sxh>

El modelo ajusta la línea de regresión con:

;#;
y = 330.46 - 13.78x + 4.82x² + 0.1x³
;#;

<sxh python>
# Ordenamos los datos para que la línea del gŕafico nos salga bien
lists = sorted(zip(*[X_train.values, y_train_predict]))
new_x, new_y = list(zip(*lists))

fig, axes = plt.subplots(figsize=(9,6))
axes = fig.add_subplot()
axes.plot(new_x, new_y, color="r")
axes.scatter(X_train, y_train)
</sxh>

{{ :clase:ia:saa:4_sml_regresion:grafico17.png?400 |}} 

===== Ejercicios =====

** Ejercicio 1 **

{{ :clase:ia:saa:4_sml_regresion:heights.zip |}}

Con el dataset anterior (altura de padres e hijos):

  * Crear un dataframe de Pandas con la información del dataset

  * Mostrar la información del dataset y el gráfico que relacione ambos datos

  * Separar el dataframe en 4 grupos: característica de entrenamiento (altura de los padres), característica de test (30% de las filas), etiqueta de entrenamiento (altura de los hijos), etiqueta de test

  * Mostrar el número de registros de cada grupo (entrenamiento y test)

  * Entrenar un modelo de regresión lineal con sklearn y mostrar los coeficientes (constante y pendiente)

  * Mostrar un gráfico con los puntos y la recta de regresión

  * Mostrar las métricas de los datos de entrenamiento

  * Calcular las predicciones de los datos de test y mostrar los valores de las diferentes métricas

** Ejercicio 2 **

{{ :clase:ia:saa:4_sml_regresion:ejercicio41.zip |}}

Con el dataset anterior:

  * Muestra un gráfico con la distribución de los datos

  * Separa los datos en entrenamiento/test

  * Muestra un gráfico con los datos de entrenamiento para asegurarte que sigue la misma distribución que los datos completos

  * Intenta ajustar la línea de regresión a los datos. Prueba con 1, 4, 10, 20, 30 y 50 grados de la ecuación. Muestra el MSE de cada una de ellas y un gráfico donde se muestre la distribución de los datos y la línea de regresión calculada