2023
This article presents an early look at how the LivePublication framework can enable live, updating components and generative content for computationally driven sciences. Presented is a language identification performance task, comparing the accuracy of two methods (langdetect and fastText).
Welcome to our LivePublication demonstration! This website represents how we can incorporate real-time computation within the narrative structure of a research article.
This LivePublication is the product of an ongoing initiative to move beyond the limitations of traditional scientific publishing by integrating dynamic computational workflows within the fabric of a publication. The LivePublication framework aims to enable live, reactive publications while simultaneously enhancing transparency, repeatability, and collaborative scientific research.
To enable this, LivePublication integrates with distributed Globus flows by providing custom Action Provides (LivePublication Action Providers, LPAPs) to generate descriptive RO-Crates at each major computational step in a workflow/methodology. These distributed artefacts are then combined into a full description of the workflow execution recording inputs, outputs, methods, and descriptive metadata. This primary artefact is then used as a data model for a publication, such as this website, to draw upon - providing updating figures, metrics, and other metadata. The structure, recorded metadata, and methods of integration with the publication are all early areas under active development.
Find a browseable version of the underpinning primary RO-Crate here!
This primary RO-Crate was generated through the execution of a Globus flow, described below in Figure 1:
While future LivePublication applications will primarily focus on how author-driven content can be realistically, and seamlessly integrated with live updating articles, this article uses mostly generative content drawing on data exported from the underlying RO-Crate. Research on how the author and live content can be integrated is ongoing. Below, GPT-4 provides a short description of the results of this computational workflow, drawing on data generated during the flow.
The evaluation of the language identification methods - fastText and Langdetect - reveals a nuanced performance profile contingent on the specific language being identified. Overall, the fastText model demonstrated superior performance with an overall accuracy of 98.6%
compared to Langdetect's 97.91%
. This comparison, however, does not capture the individual variances in accuracy across languages for the two models.
Delving into these language-specific performances, FastText exhibits impeccable accuracy in identifying several languages. These include German (deu), Greek (ell), English (eng), French (fra), Japanese (jpn), Thai (tha), and Chinese (zho) - all at 100% accuracy. Other languages such as Bulgarian (bul), Italian (ita), Russian (rus), and Vietnamese (vie) also show remarkable results with accuracy close to 100%. FastText's weakest performance is observed for Swahili (swa) at 85.4%
accuracy, indicating a potential area for model improvement.
On the other hand, Langdetect also showcased impressive accuracy with several languages reaching 100% identification rate, namely Greek (ell), Japanese (jpn), Thai (tha), and Vietnamese (vie). It performed notably well with Arabic (ara), German (deu), and Turkish (tur) too, with accuracy rates nearing 100%. The lowest performance was observed with Dutch (nld) at 93.6%
, signifying a potential area of focus for future model enhancements.
When comparing the two models on specific languages, FastText notably outperforms Langdetect in identifying languages such as Bulgarian, English, French, Dutch, Polish, Portuguese, and Spanish. Conversely, both models demonstrate equivalent performance in Arabic, Greek, Japanese, Thai, and Vietnamese identification. Langdetect's performance appears to surpass FastText slightly in Hindi and Urdu.
Fig. 2: fastText and langdetect method accuracy over test data, by language ID
Below is a very early attempt at generative content based on the generated Workflow Execution Plan (WEP). The WEP only provides a description of the Globus flow, and includes no information regarding the actual execution of the method. Including details from the Workflow Execution Description (WED) can further enhance the description including things like execution state (succeeded, failed), time taken per step, and other pertinent information.
The methodology to compare the performance of two language identification models, fastText and langdetect, involves the execution of a series of actions within a structured workflow. The actions range from data transfer operations to the execution of language identification models and statistical analysis of the generated results. The overall flow is organized as follows:
Data Transfer for fastText Model: The process begins by transferring the required dataset from the data store to the location accessible by the fastText model. This is done using Globus, a high-speed data transfer protocol. The specific parameters of this operation such as the source and destination endpoint identifiers, along with the respective paths are supplied at runtime.
fastText Model Execution: With the data in place, the fastText language identification model is executed. The model reads the input data from the path specified in the previous step, performs language identification, and stores the result in the same location.
Result Transfer for fastText Model: The output of the fastText model, including statistics related to its performance, is then transferred back to the data store via another Globus transfer operation.
Data Transfer for langdetect Model: Similarly to step 1, the required dataset is transferred from the data store to the location accessible by the langdetect model using a Globus transfer operation.
langdetect Model Execution: The langdetect model is then executed, processing the transferred data to perform language identification. The results of the model are stored in the designated location.
Result Transfer for langdetect Model: The results of the langdetect model, including its performance statistics, are then transferred back to the data store via Globus transfer operation.
Statistical Analysis: With the results from both models in place, a statistical analysis is performed to compare their performances. The analysis includes accuracy statistics, and the generation of figures, tables, and other representations of the results for presentation in the publication layer.
Statistics Transfer to the Data Store: Finally, the results of the statistical analysis are transferred back to the data store using another Globus transfer operation.
Each of these operations is executed asynchronously, with designated wait times to ensure the completion of each task before moving to the next. The process has been designed for scalability and efficiency, with a focus on managing data and computational resources effectively.
Below, we experiement with generating code / method descriptions using the included method files used within the Globus flow.
This script is used in the fastText state of the computational workflow.
FastText is a library for text classification and representation learning, created by Facebook's AI Research lab. This script loads a FastText model trained to classify text and then uses this model to predict the class of each line of an input text file.
Here's a step-by-step breakdown:
In the context of the workflow, this script follows the DS_fastText_Transfer state where the necessary files are transferred to the appropriate location. The output of this script, fastText_predictions.txt, is used in the final statistics state of the workflow to compute accuracy statistics.