Abstracts Submitted: 250
Number of Users: 295
View Abstracts Submitted
Back to home Page
In Japan, special education classes are provided in elementary schools to support children whose language development lags behind their peers due to hearing, speaking, or developmental disabilities. We have developed software that supports children who repeatedly practice their pronunciation in such classes [Masuda-Katsuse in INTERSPEECH 2014]. In our software, the speech sounds were evaluated and the possible scores for the correct and probable error phonemes were represented, but the precision was insufficient. Recently, phonetic features represented on deep neural network (DNN) layers have been investigated. Sim [ASRU 2015] displayed hidden layer activities on a 2-dimensional representation where activities for different phonemes were compared. Nagamine et al. [INTERSPEECH 2015] showed that phonetic features organize the activations in different layers. Therefore, we aim to use the visualization of the activities of DNN layers to visually feedback pronunciation evaluations. As a pilot study, we investigated the discrimination performance of Japanese phonetic features on 2-dimensional manifolds on which the DNN layer’s activities were mapped. We used the Corpus of Spontaneous Japanese for this study. The feature vectors and their attribute instances were extracted using the Kaldi toolkit. The 13-dimensional MFCC features were spliced in time taking a context size of 9 frames (±4 frames), followed by dimensionality reduction to 40 using LDA. The results were de-correlated using MLLT. Then we applied speaker normalization using fMLLR. These 40-dimensional feature vectors were regarded as the spectral feature value (SpF) of each frame. A forced alignment of Kaldi toolkit was used to label the feature vectors of each frame with pdf-ids (indices of pdf in a transition model). Since the 497 pdf-ids were mapped to 25 phonemes, the phonemes were allocated to each frame as the attribute instances. In addition, periodicities were obtained by calculating a logarithm after multiplying aperiodicity calculated by WORLD [Morise, Sp. Comm., 84:57-65, 2016] by -1. They were spliced in time taking a context size of 9 frames (±4 frames) and regarded as source features (SoF) for each frame. The 49-dimensional feature vectors were also generated by concatenating SpF and SoF. The neural network had a 40- or 49-dimensional input layer, a hidden layer with 256 nodes, four hidden layers, each of which has 2048 nodes, and an output layer with 25 nodes. The ReLU activation functions were used for all the units in the layers except in the output layer. The softmax activation function was used for the output layer. The phonemic classification accuracies for the network were 58.8% when the input vectors were SpF and 58.3% when they were SpF+SoF. Next the activities at the stage prior to the activation function of the output layer were extracted and mapped to a 2-dimensional manifold using a parametric t-SNE. In a parametric t-SNE, the model for the transformation is calculated using a DNN. For training the DNN, we used 2,000 activities for each phoneme as training and validation data. The parametric t-SNE enables the mapping test data onto the mapping training data, and therefore, we can visually grasp the distance between the pronunciation to be evaluated and the correct phoneme or possible error phonemes. On the manifold, since the neighborhood of each point is a local Euclidean space, the phonemes of the test data were predicted by large margin 5-nearest neighbor classification, and the accuracies were obtained. 300 activities for each phoneme were used as test data. When the feature vectors were SpF, the consonant accuracy was 79.1 and the vowel accuracy was 80.2%. When they were SpF+SoF, the consonant accuracy was 77.5% and the vowel accuracy was 80.5%. As the result, there seems no effect of adding periodicity features. In the software for pronunciation practice developed by us, the ability to detect articulation errors is critical. The accuracy for the place of articulation was 83.0% and that for its manner was 84.1% when the feature vectors were SpF.
© Copyright 2017 All Rights Reserved