What question did this study set out to answer?

To evaluate the effectiveness of various text-based similarity measures in automating industrial classifications against human-coded benchmarks.

April 23, 2026

Automating enterprise industry classifications for official statistics: Leveraging text-based similarity measures

Key Points

To evaluate the effectiveness of various text-based similarity measures in automating industrial classifications against human-coded benchmarks.
Assessed token-overlap, TF-IDF cosine, edit-distance, and SBERT embeddings for classification.
Utilized a dataset of 6588 firms to compare accuracy and performance.
Employed statistical testing to validate differences in method effectiveness.
SBERT achieved an accuracy of 0.78, outperforming other methods.
Manual methods showed lower accuracy rates (Fuzzy: 0.43; Cosine: 0.31; Jaccard: 0.26).
Statistical tests confirmed significant performance differences with a p-value < 0.001.

Abstract

Accurate industrial classification of firms forms the backbone of business surveys, economic policymaking, and international trade analysis. However, national statistics institutes (NSIs) worldwide grapple with the labor intensive manual assignment of International Standard Industrial Classification (ISIC) codes: a process prone to human error, inconsistent across regions, and particularly burdensome for developing economies. This study confronts these challenges by assessing performance of token-overlap (Jaccard), TF-IDF cosine similarity, edit-distance (fuzzy) and SBERT embeddings against human-coded ground truth in classifying firms. Using a dataset of 6588 firms, performance diverges sharply: SBERT attains Accuracy = 0.78 and Weighted F 1 = 0.78 (Cohen’s κ ≈ 0.75 ), while surface methods lag (Fuzzy: Accuracy 0.43; Cosine: 0.31; Jaccard: 0.26). Statistical tests confirms these differences (Cochran’s ( Q = 8320.81 ) with p 0.001 ) and inter-method agreement is only fair ( κ Fleiss ≈ 0.270 ), motivating a class-level diagnostic approach. Using confusion matrices and Haberman adjusted residuals we expose systematic off-diagonal confusions (notably between manufacturing, professional/service and certain retail/wholesale categories) and identify classes with strong, automatable diagonals versus sparse or ambiguous tails that require human coding.

Bookmark

Automating enterprise industry classifications for official statistics: Leveraging text-based similarity measures

Key Points

Abstract

Cite This Study