a few failures where the model is unable to handle sample stuttering. These failure samples were used in the training of the next cohort of the model. The training was done for six successive trainings with each of the training data blocks. This number of iterations was chosen because it leads to a stable learning result but is not time consuming to train.

The training stopped when the accumulated error of the last training step was lower than 0.2% of the total error. A summary of the training is shown in Table $tab:inerv2$.

The model was trained for each gender on all three datasets and has been evaluated on all the three datasets. The evaluation is done by calculating the mean unsigned differences (MAD) [@fawcett2006introduction] of all the trials between ground truth and model output. The mean squared error (MSE) is obtained by calculating the mean squared error of each trial from the average. This average value is then used to calculate the mean squared error. $$\text{MSE} = \frac{1}{n}\sum_{k=1}^n(\textbf{y}_k -\hat{\textbf{y}}_k)^2$$

$$\text{MAD} = \sqrt{\frac{1}{n}\sum_{k=1}^n|(\textbf{y}_k -\hat{\textbf{y}}_k)|^2}$$

Results
——-

The training and evaluation of all the scenarios were done with a clean and noisy speech dataset. This classification was done for all the three datasets and every model was tested on the three datasets. The results of the models are shown in Table $tab:results$. The best result obtained for each model is bold and the rank of the models are shown below.

All the speech enhancement models are better in the clean speech classification than the noisy speech classification. The best performance was obtained for the model trained on the subject data and tested on the other two datasets.

The model trained for a given dataset has the highest accuracies when tested on the same dataset. The highest accuracies are obtained for the models trained on the subject data and tested on the others. The