Optimisation as a tool for natural language processing

ANZIAM 2022: Contributed Talk

Slides:

Abstract:

Every day more textual data is created, for example, through online product review and social media websites. In recent years, this has also included text created by libraries and archives as they digitise and transcribe their collections of historic documents. This process not only allows researchers easier access to historic documents, but it also allows large collections of historic documents to be analysed using natural language processing techniques. However, natural language processing techniques have primarily been developed and tested on modern data. As such, it may not be possible to accurately apply these techniques to historic data. Therefore, it is necessary to research these techniques, and the underlying structure of written language, in order to ensure the techniques can accurately be applied to historic text. Our research focuses on investigating and refining two natural language processing techniques, sentiment analysis and date extraction, to ensure they are accurate on historic text. For sentiment analysis, this involves investigating factors which contribute to a person’s sentiment, such as their gender or life experiences. This also requires us to investigate whether the assumptions used in the development of current techniques match the structure of written historic texts. For example, it has been shown that sentiment techniques which rely on a dictionary are often inaccurate when applied to texts from different contexts and time periods [1, 2]. When considering techniques for analysing historic documents, it is necessary to have an accurate date extraction technique as many historic diaries have been collected by libraries. However, it is necessary to have methods that not only extract dates, but also clean them as dates are not always written in a clear and consistent format. To do this, it is necessary to understand the various methods that dates are written, and the reasons why these may not be written in a consistent format. In this talk we will give an overview of these techniques, and why it is necessary to have a well-formed mathematical model which could be applied in various circumstances. We will then give an example of how this can be done, by discussing our current progress on a date extraction technique which uses optimisation in order to clean dates.

[1] W. L. Hamilton, K. Clark, J. Leskovec, and D. Jurafsky, “Inducing Domain-Specific Sentiment Lexicons from Unlabelled Corpora,” arXiv:1606.02820 [cs], Sep. 2016.

[2] J. Lukes and A. Sogaard, “Sentiment analysis under temporal shift,” in Proceedings of the 9th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis. Brussels, Belgium: Association for Computational Linguistics, Oct. 2018, pp. 65–71.