Key points are not available for this paper at this time.
At present several vectorization approaches are used to transform text documents into a numerical format. A huge number of features converted from text data from a single document take time to process vectorized data with large dimensions. To reduce the number of dimensions, this work uses an improved Nave Bayes algorithm to vectorize documents according to a distribution of probabilities reflecting the probable categories to which the document that belongs. The improved Nave Bayes vectorization used Laplace smoothing to ensure that posterior probabilities are never zero and logarithmic function to solve the result of the probability calculation that is too small that cannot be represented. The text classification algorithms based on the vector space model, such as the Support Vector Machine (SVM), use this probability distribution as the vectors to represent the document that is used to classify the documents. To validate the improvement of the Nave Bayes vectorization technique, the results are compared to TF-IDF vectorization. The results showed that the transformation of data by improved Nave Bayes vectorization technique reduces dimensionality and has contributed to better performance of the SVM classification approach.
Hajah T. Sueno (Thu,) studied this question.