Context- and content-aware node vectorization is the process of representing graph nodes as low-dimensional vectors by taking into account the graph’s structure, the content of each node, often textual data, and the context surrounding the node. Graphs where node content is text are known as textual graphs, and they appear in many domains, such as social networks, recommendation systems, and academic citation networks. In citation graphs, which are the focus of this study, each node represents a research article, the content is the article's text, and edges indicate citation or reference relationships between papers. This study investigates how keyword and keyphrase extraction methods can be used to simplify node content while improving the performance of node embedding methods. Several text extraction methods are evaluated and applied to a large citation graph constructed from ArXiv papers, assessing their output using two node embedding methods: CANE and DeepEmLAN. By replacing full-text inputs with concise, descriptive keyphrases, the experiments achieve faster processing while frequently maintaining or even improving performance in link prediction and node classification tasks. The study also investigates a text enrichment strategy that leverages known node category information. Additionally, a graph augmentation approach is examined to better simulate real-world conditions, demonstrating that this preprocessing technique amplifies the gap in computation times between the two node embedding methods when using full-text inputs versus the keywords and keyphrases extracted by the text extraction methods.
Γεώργιος Π. Ματλής (Wed,) studied this question.