Language identification method comparison: LivePublication example

Augusus Ellerm, Ben Adams, Mark Gahegan

2023

This article presents an early look at how the LivePublication framework can enable live, updating components and generative content for computationally driven sciences. Presented is a language identification performance task, comparing the accuracy of two methods (langdetect and fastText).

Welcome to our LivePublication demonstration! This website represents how we can incorporate real-time computation within the narrative structure of a research article.

This LivePublication is the product of an ongoing initiative to move beyond the limitations of traditional scientific publishing by integrating dynamic computational workflows within the fabric of a publication. The LivePublication framework aims to enable live, reactive publications while simultaneously enhancing transparency, repeatability, and collaborative scientific research.

To enable this, LivePublication integrates with distributed Globus flows by providing custom Action Provides (LivePublication Action Providers, LPAPs) to generate descriptive RO-Crates at each major computational step in a workflow/methodology. These distributed artefacts are then combined into a full description of the workflow execution recording inputs, outputs, methods, and descriptive metadata. This primary artefact is then used as a data model for a publication, such as this website, to draw upon - providing updating figures, metrics, and other metadata. The structure, recorded metadata, and methods of integration with the publication are all early areas under active development.

Find a browseable version of the underpinning primary RO-Crate here!

This primary RO-Crate was generated through the execution of a Globus flow, described below in Figure 1:

Fig. 1: Globus flows integrating LPAP's for generation of artefacts

Fig. 1: Globus flows integrating LPAP's for generation of artefacts

LID performance comparison (⚙️ Generative Content ⚙️)

While future LivePublication applications will primarily focus on how author-driven content can be realistically, and seamlessly integrated with live updating articles, this article uses mostly generative content drawing on data exported from the underlying RO-Crate. Research on how the author and live content can be integrated is ongoing. Below, GPT-4 provides a short description of the results of this computational workflow, drawing on data generated during the flow.

  • Model: GPT-4
  • Prompt: This experiment compares the performance of two language identification methods: fastText and Langdetect. FastText's results are results and langdetect results are results. Write a few paragraphs discussing the performance of each, comparing their best and worst language accuracies.

The evaluation of the language identification methods - fastText and Langdetect - reveals a nuanced performance profile contingent on the specific language being identified. Overall, the fastText model demonstrated superior performance with an overall accuracy of 98.6% compared to Langdetect's 97.91%. This comparison, however, does not capture the individual variances in accuracy across languages for the two models.

Delving into these language-specific performances, FastText exhibits impeccable accuracy in identifying several languages. These include German (deu), Greek (ell), English (eng), French (fra), Japanese (jpn), Thai (tha), and Chinese (zho) - all at 100% accuracy. Other languages such as Bulgarian (bul), Italian (ita), Russian (rus), and Vietnamese (vie) also show remarkable results with accuracy close to 100%. FastText's weakest performance is observed for Swahili (swa) at 85.4% accuracy, indicating a potential area for model improvement.

On the other hand, Langdetect also showcased impressive accuracy with several languages reaching 100% identification rate, namely Greek (ell), Japanese (jpn), Thai (tha), and Vietnamese (vie). It performed notably well with Arabic (ara), German (deu), and Turkish (tur) too, with accuracy rates nearing 100%. The lowest performance was observed with Dutch (nld) at 93.6%, signifying a potential area of focus for future model enhancements.

When comparing the two models on specific languages, FastText notably outperforms Langdetect in identifying languages such as Bulgarian, English, French, Dutch, Polish, Portuguese, and Spanish. Conversely, both models demonstrate equivalent performance in Arabic, Greek, Japanese, Thai, and Vietnamese identification. Langdetect's performance appears to surpass FastText slightly in Hindi and Urdu.

Fig. 2: fastText and langdetect method accuracy over test data, by language ID

Fig. 2: fastText and langdetect method accuracy over test data, by language ID

Workflow Description (⚙️ Generative Content ⚙️)

Below is a very early attempt at generative content based on the generated Workflow Execution Plan (WEP). The WEP only provides a description of the Globus flow, and includes no information regarding the actual execution of the method. Including details from the Workflow Execution Description (WED) can further enhance the description including things like execution state (succeeded, failed), time taken per step, and other pertinent information.

  • Model: GPT-4
  • Prompt: Generate a description of this workflow, from the perspective of an academic methodological section: WEP

The methodology to compare the performance of two language identification models, fastText and langdetect, involves the execution of a series of actions within a structured workflow. The actions range from data transfer operations to the execution of language identification models and statistical analysis of the generated results. The overall flow is organized as follows:

  1. Data Transfer for fastText Model: The process begins by transferring the required dataset from the data store to the location accessible by the fastText model. This is done using Globus, a high-speed data transfer protocol. The specific parameters of this operation such as the source and destination endpoint identifiers, along with the respective paths are supplied at runtime.

  2. fastText Model Execution: With the data in place, the fastText language identification model is executed. The model reads the input data from the path specified in the previous step, performs language identification, and stores the result in the same location.

  3. Result Transfer for fastText Model: The output of the fastText model, including statistics related to its performance, is then transferred back to the data store via another Globus transfer operation.

  4. Data Transfer for langdetect Model: Similarly to step 1, the required dataset is transferred from the data store to the location accessible by the langdetect model using a Globus transfer operation.

  5. langdetect Model Execution: The langdetect model is then executed, processing the transferred data to perform language identification. The results of the model are stored in the designated location.

  6. Result Transfer for langdetect Model: The results of the langdetect model, including its performance statistics, are then transferred back to the data store via Globus transfer operation.

  7. Statistical Analysis: With the results from both models in place, a statistical analysis is performed to compare their performances. The analysis includes accuracy statistics, and the generation of figures, tables, and other representations of the results for presentation in the publication layer.

  8. Statistics Transfer to the Data Store: Finally, the results of the statistical analysis are transferred back to the data store using another Globus transfer operation.

Each of these operations is executed asynchronously, with designated wait times to ensure the completion of each task before moving to the next. The process has been designed for scalability and efficiency, with a focus on managing data and computational resources effectively.

Method / Code descriptions

Below, we experiement with generating code / method descriptions using the included method files used within the Globus flow.

fastText Method

  • Model: GPT-4
  • Prompt: You will generate easy-to-understand natural language descriptions of these scripts. Primarily focus on the methods used / how we arrive at the outputs. For context - these scripts are run within a computational workflow, here is the definition of the workflow: WEP. When generating your descriptions, include a description of how this script fits within the context of the workflow. Script: script.

This script is used in the fastText state of the computational workflow.

FastText is a library for text classification and representation learning, created by Facebook's AI Research lab. This script loads a FastText model trained to classify text and then uses this model to predict the class of each line of an input text file.

Here's a step-by-step breakdown:

  1. The script begins by setting up necessary directories. The base directory (BASE_DIR) is set as the directory where this script is located. INPUT_DIR and OUTPUT_DIR are then defined as subdirectories of the base directory.
  2. In the main execution block (if name == "main":), the script first opens an input file (sys.argv[1]) with read access using the codecs library, which allows the script to handle text encoded in UTF-8. The content of the file is read line by line into the lines variable.
  3. A pre-trained FastText model is loaded from a file named "model.lid.top800.epoch20.neg100.dim100.ns.small.min5.ftz". This model is trained to predict the class of a line of text.
  4. The script then iterates through each line in the input file again. For each line, any newline character () at the end of the line is removed using rstrip(""). The cleaned line is then fed to the FastText model to generate a prediction. Each prediction is appended to the predictions list.
  5. Finally, all the predictions are written to a file in the output directory named fastText_predictions.txt. Each prediction is written on a new line, with any "label" prefix removed.

In the context of the workflow, this script follows the DS_fastText_Transfer state where the necessary files are transferred to the appropriate location. The output of this script, fastText_predictions.txt, is used in the final statistics state of the workflow to compute accuracy statistics.