1. Introduction
The research on Text-To-Scene Conversion (TTSC) dates back to 1980 (Adorni, Di Manzo, and Giunchiglia Reference Adorni, Di Manzo and Giunchiglia1984). Most TTSC systems take a natural language text in English as their input, but TTSC systems in other languages such as Swedish (Johansson, Nugues, and Williams Reference Johansson, Nugues and Williams2004), French (Kayser and Nouioua Reference Kayser and Nouioua2009), Chines (Lu and Zhang Reference Lu and Zhang2002), Korean (Hong et al. Reference Hong, Cho, Jeon and Park2018), Hindi (Jain et al. Reference Jain, Bhavsar, Kumar, Pawar, Darbari and Bhavsar2018), Japanese (Takahashi, Ramamonjisoa, and Ogata Reference Takahashi, Ramamonjisoa and Ogata2007), Russian (Ustalov and Kudryavtsev Reference Ustalov and Kudryavtsev2012), and Indonesian (Helfiandri, Zakhralativa Ruskanda, and Khodra Reference Helfiandri, Zakhralativa Ruskanda and Khodra2020) have been developed. To the best of the authors’ knowledge, no TTSC system from Persian natural language has previously been reported. The current study develops one of the initial steps in a Persian text-to-scene conversion system called PERSIS MEANS (PERSIan Story to MEaningfully ANimated Scene). The final PERSIS MEANS which will be developed in the future steps of the current study will take an input story in Persian natural language text and produce meaningful animation based on that. This study recognizes those tokens of the input story text that have a visualization in the final animation (i.e., the output of PERSIS MEANS) and then fills a conceptual scene model by these tokens as output. These conceptual scene models correspond to the scenes in the final animation and contain descriptions of visual elements of those scenes. The proposed conceptual scene model and how to fill it using the input text is the main contribution of the current study.
The difficulty of this mapping task from input natural language text to formal representation required for visual scenes (like conceptual scene model in the current study) is one of the main challenges in developing TTSC. This mapping task was done deterministically in the elementary stages of well-known TTSC systems, such as WordsEyeFootnote a (Coyne and Sproat Reference Coyne and Sproat2001), SceneSeerFootnote b (Chang, Savva, and Manning Reference Chang, Savva and Manning2014b; Zeng, Tan, and Ren Reference Zeng, Tan and Ren2016; Pardhi et al. Reference Pardhi, Shah, Vaghasiya and Hole2021, and Yadav, Sathe, and Chandak Reference Yadav, Sathe and Chandak2020) which focus on obtaining and correctly visualizing the spatial relations between the objects of a scene. The relatively simple and constrained structure of the input sentences in all of these systems enable their developers to map input tokens to formal representation deterministically. The domain of the current study was literary realistic stories about the prophets. Story sentences were not focused on the objects or their properties, but on the events occurring in the context of encounters by each prophet with his tribe. In the first attempt of mapping, these Persian stories to the conceptual scene model, deterministic rules based on the part-of-speech (POS) tags, word-sense-disambiguation (WSD), and semantic role labeling (SRL) of input text were designed. However, the structural challenges of Persian natural language, especially in literary stories of the above mentioned domain, resulted in low accuracy in recognizing the elements of the conceptual scene model. Persian literary text usually forms long sentences. The average token number per sentence in this study was 17.63. Most of the sentences had more than one verb. This means that almost all of the sentences had nested inner sentence/sentences. The average number of verbs per sentence was 2.34. In many sentences, key semantic roles for the SRL were omitted because of syntactic or semantic symmetry, especially roles, which pointed to the actor of the verb or the thing that was acted upon (Shamsfard Reference Shamsfard2011). These challenges of Persian literary sentences prevented the deterministic development of the system in the first attempt.
WordsEye and CONFUCIUS have used advanced semantic processing levels (Ma Reference Ma2006; Coyne et al. Reference Coyne, Rambow, Hirschberg and Sproat2010) in which mapping takes place during the process (Jackendoff Reference Jackendoff1990). The NLP in Persian language has made acceptable progress in morphological and syntax processing (Shamsfard, Jafari, and Ilbeygi Reference Shamsfard, Jafari and Ilbeygi2010b), but semantic analysis of Persian text is in the initial stages and no off-the-shelf module is available (Shamsfard Reference Shamsfard2011). In recent years, some SRL corpora have been built in academic research labs. These corpora need to be checked by experts, and their accuracy needs to be improved. One of these corpora was used in the current study and will be described below.
Other TTSC systems have developed mapping through machine learning techniques. Glass and Bangay (Reference Glass and Bangay2008) used hierarchical rule matching and generalization to identify semantic categories of concepts in fiction books. Rouhizadeh (Reference Rouhizadeh2013) enriched location information of VigNet, which is in the core of WordsEye, using crowd-sourcing data. Chang, Savva, and Manning (Reference Chang, Savva and Manning2014a) expanded the deterministically extracted set of explicit spatial relations in SceneSeer with implicit spatial relations not specified in the text using learned spatial priors. Chang et al. (Reference Chang, Monroe, Savva, Potts and Manning2015) also used a machine learning approach for the lexical grounding problem. To map Persian natural language text to a conceptual scene model, the elements of the conceptual scene model were learned from story tokens through machine learning techniques in the current study. The proper dataset preparations and choosing the best model to fit the problem at hand are the two success factors for solving a problem using machine learning models. Because there is no TTSC system in Persian to the authors’ knowledge, no such a dataset exists.
To prepare the required dataset, the input text must be processed using NLP pipeline modules. The POS tagger and syntax analyzer modules are available in Persian, but no off-the-shelf module for semantic analysis and WSD is available (Shamsfard Reference Shamsfard2011). An SRL corpus containing only 7656 tokens is available (Mesgar et al. Reference Mesgar, Hajizade, Darrudi, Farhoodi, Mohamadzade, Alavi, Davoudi, Sarabi and Khalash2014). The limited number of sentences with SRL tags available limited the total number of tokens in the prepared dataset to 7946 tokens in 451 sentences. None of these tokens was disambiguated through lexicons; thus, the WSD tags were prepared manually. Sense granularity, the derivational and generative nature of the Persian language were some challenges faced during the disambiguation of a dataset’s tokens. To recognize conceptual scene model elements from story tokens in a supervised manner, each token should be tagged with a class label of the type of scene model element. These tags were manually prepared. The dataset prepared in the current study, especially its WSD tags, can be used by other researchers to learn models for different tasks and produce larger datasets.Footnote c
Using sentences to produce a conceptual scene model makes the nature of the problem sequential. Independent modeling of each scene element (as in traditional machine learning modeling) results in discarding important information that exists in the sequence of tokens in a sentence. Therefore, in the current study, the conditional random field (CRF) model was selected for the sequence labeling task (Lafferty, McCallum, and Pereira Reference Lafferty, McCallum and Pereira2001). The imbalanced nature of the collected data motivated the use of CRF. The proportion of different elements of the conceptual scene model in the prepared dataset was diverse. CRF can handle this imbalanced data and learn the elements with a small number of samples, as in named-entity recognition (NER) problem modeling (Sutton and McCallum Reference Sutton and McCallum2012). As a second attempt for the mapping task at hand, learning with traditional non-sequential machine learning models was tested, resulting in 76.58% average accuracy. Using CRF as the third and the last attempt, despite limitations on the number of samples in the dataset, resulted in acceptable accuracy. The average accuracy was 85.7%. Standard evaluation metrics of labeling are provided in detail.
The main contributions of this study consist of the development of the first TTSC system with Persian natural language text as input, the modeling of the problem of mapping Persian natural language text to the formal representation required for visual scene as a sequence-tagging problem and learning a CRF model for this task, preparation of a dataset for the mapping task, and tackling the problem of a lack of the required off-the-shelf NLP modules to prepare the dataset.
Section 2 presents works related to the current study in detail. Section 3 introduces the elements of the conceptual scene model, as proposed in the current study. Section 4 provides detailed information about the process of dataset collection. The machine learning model used to label the visual elements of the conceptual scene model is addressed in Section 5. Section 6 provides and discusses the evaluation results of labeling visual scene elements in the story text, and Section 7 presents conclusions about the mapping problem at hand.
2. Related work
The design methodology and language understanding approaches used for mapping the input natural language text to a formal representation in TTSC systems are reviewed in this section. These design methodologies can be classified as deterministic or automatic. The automatic approaches can further be classified as rule-based or data-driven (Hassani and Lee Reference Hassani and Lee2016). The data-driven approach was applied in the current study by learning a CRF model based on the collected dataset. Both syntactic and semantic analyses of the input text were used to produce the dataset required for the mapping task.
WordsEye is an early and frequently cited TTSC system that converts text to a static 3D scene (Coyne and Sproat Reference Coyne and Sproat2001). Its input sentences consist of simple position words, color, size, and distance of objects like a dog is on the table. The relatively limited structure of their input text enabled them to use deterministic rules for mapping. It consists of two components: linguistic analysis and a scene depictor. The linguistic analysis component parses the input text and constructs a dependency structure. This structure is then utilized to construct a semantic representation deterministically. This semantic representation is a formal representation in which objects, actions, and relations are represented in terms of semantic frames (Fillmore Reference Fillmore1982). This component uses a POS tagger and associates the words with noun POS tags with 3D objects. It applies a set of predefined spatial patterns based on the dependency structure and captures the spatial relations. The words with verb POS tags are associated with a set of parametrized functions by the component.
Their focus is on correctly visualizing the senses of the words; thus, they have advanced the semantic processing of the input text in recent developments. They have incorporated lexical, semantic, and contextual knowledge to propose Scenario-Based Lexical Knowledge Resource (SBLR) (Coyne et al. Reference Coyne, Rambow, Hirschberg and Sproat2010). SBLR is a lexical knowledge base customized to represent the lexical and common-sense knowledge for TTS conversion purposes and is derived from WordNet (Miller Reference Miller1995) and FrameNet (Ruppenhofer et al. 2010). The SBLR developers have augmented the derived lexical semantic information to include finer-grained relations and properties of entities required to depict scenes and capture the different senses of properties related to those properties and relations (Coyne et al. Reference Coyne, Rambow, Hirschberg and Sproat2010). The researchers used both syntactic and semantic analysis of natural language text to visualize a scene based on the input text correctly. They integrated deterministic, semantic processing and data-driven approaches to fulfill the mapping task.
CONFUCIUS is a text-to-animation conversion system that receives a natural language sentence in English and visualizes it as 3D animation (Ma Reference Ma2006). Lexical visual semantic representation (LVSR) was proposed for this system to represent a relationship between the lingual and visual meanings. LVSR is based on the lexical conceptual structure (LCS) (Jackendoff Reference Jackendoff1990). The need for a method of mapping syntax onto semantics and vice versa is the core of LCS. Jackendoff proposed a set of entities (conceptual primitives) for conceptual structure and a sound foundation onto which communication rules between syntax and semantics (these conceptual entities) could be built. LVSR adapted LCS for the purpose of language visualization and provided finer ontological categories of concepts for generating humanoid character animation. It also related arguments in the conceptual structure to arguments in syntax through a set of rules. They integrated advanced levels of semantic processing and rule-based techniques to solve the mapping problem.
Glass and Bangay (Reference Glass and Bangay2008) converted natural language fiction books to a 3D animated virtual environment. Initially, the natural language text in English was converted to an intermediate representation, and then, this intermediate representation was converted to a populated 3D environment. This intermediate representation consists of the original text that is annotated in different categories of concepts (Glass and Bangay Reference Glass and Bangay2009). A hierarchical rule-based learning system was created to learn patterns that were used to create annotations. The patterns are tree structures that abstract the input text according to the structural (token, phrase, sentence) and syntactic (parts-of-speech, syntactic function) categories. They stated that a supervisor must slightly manipulate such annotated text before being converted to animation (Glass and Bangay Reference Glass and Bangay2009). In this research, a semi-automatic rule-based learning technique was applied to the mapping problem. No semantic analysis was used to understand language.
SceneSeer is an interactive text-to-3D scene generation system that allows users to design 3D scenes using natural language (Chang et al. Reference Chang, Savva and Manning2014b). The developers parsed the textual description of a scene and deterministically converted it to a scene template to generate the 3D scene described by the input text. This scene template includes a set of constraints on the objects present and the explicit spatial relations between them. For each object $o_{i}$ , properties like category label, color, material, and the number of occurrences in the scene are identified based on the phrase in which the object is mentioned. For this rule-based approach, the text’s sentences are first syntactically analyzed using the Stanford CoreNLP pipeline.Footnote d Headwords of noun phrases were identified as candidate objects. They filtered each noun using WordNet so that they only include physical objects (excluding locations). Coreference resolution was done using the Stanford Coreference system. To extract the properties of each object, they assessed adjectives and other nouns in noun phrases containing the name of that object. The explicit spatial relations between objects were extracted using dependency patterns.
The developers of SceneSeer then improved on the TTSC system and expanded this deterministically extracted set of explicit spatial relations to include implicit spatial relations not specified in the text using learned spatial priors (Chang et al. Reference Chang, Savva and Manning2014a). For this purpose, they collected a set of texts describing spatial relations between two objects in 3D scenes by running an experiment on Amazon’s Mechanical Turk (AMT) (Fort, Adda, and Cohen Reference Fort, Adda and Cohen2011). AMT is an online crowd-sourcing framework for data collection using human intelligence tasks (HITs). As an interactive text-to-3D scene generation system, the users of SceneSeer use textual commands to interactively refine the automatically created scene by adding, removing, replacing, and manipulating objects (Chang et al. Reference Chang, Eric, Savva and Manning2017). The SceneSeer developers generally used deterministic rules to map input natural language text to the scene template and then used data-driven techniques to enrich the scene template. Only syntactic analysis of natural language text was used in their study for the mapping task, and their focus was on spatial information of the scene.
The information comparing these TTSC systems is shown in Table 1. Table 2 also shows the different types of information used in each of these systems. A syntactic analysis of the input text and WSD were used in all of these TTSC systems, but the semantic analysis was used in only some of them, such as WordsEye and CONFUCIUS.
Sequential modeling of sentences through CRF has been used by NLP studies in the Persian language, similar to studies in the English language. Arian and Sabbagh (Reference Arian and Sabbagh2017) applied CRF to learn semantic role labeling of Persian sentences and had acceptable success.
3. Elements of conceptual scene model
Miaoulis and Plemenos (Reference Miaoulis and Plemenos2009) stated that conceptual scene models are scene models that have been modeled in a declarative expression in natural language documents or based on a semantic network structure. Such models are generic and result in the creation of a group of less abstract models. They can be converted to 3D visualization of the final scene by adding spatial arrangements and determining 3D models of the scene elements. The elements of the conceptual scene model, which the input text is mapped to them in the current study, are introduced in Table 3.
This conceptual scene model consists of two main parts. The first defines the specifications of the elements present in the scene, and the second includes the information about general specifications of the scene, such as the location or time which the scene occurs in. In the first part, main scene elements can take three forms: ROLE, ANIMATED-OBJECT, and STATIC-OBJECT. ROLE represents the human characters, which are among the most important elements of the scene. Each human character in the input text is mapped to a ROLE in the scene model. The non-human elements of the story are objects. If the objects are animated, in the sense that they can perform an action, they are categorized as ANIMATED-OBJECTs. Animals are the most typical ANIMATED-OBJECTs and can perform acts such as running, sitting, and eating. Non-animated objects fall into the category of STATIC-OBJECTs.
Each of these three main scene elements can have more specific properties, which should be modeled in the conceptual scene model. For example, the different states of each object (ANIMATED/STATIC) are modeled as the ANIMATED-OBJECT-STATEs/STATIC-OBJECT-STATEs. The actions performed by each ANIMATED-OBJECT/ROLE based on the input story are represented by OBJECT-ACTION/ROLE-ACTION. The state and intent of a character in a scene are represented by ROLE-STATE and ROLE-INTENT, respectively.
Although ROLE-STATE and ROLE-INTENT properties of a ROLE have similarity, but they have different applications in the scene model. ROLE-STATEs of a ROLE usually reflect in the visualization directly, despite ROLE-INTENTs, which usually do not directly reflect in the scene and have an implicit effect on other scene elements instead. For example, the word (“slander”) in the phrase (“to slander”), the word (“introduction”) in the phrase (“introduction ceremony”), and the word (“talk”) in the phrase (“to talk”) have no direct visualization, but they reflect in the state of the ROLE described by these words (mapped to ROLE-INTENTs). In all these 3 cases, the ROLE performs the “talking” ROLE-ACTION but in the case of (“slander”) probably talks with an angry face, in the case of (“introduction”) probably talks with courtesy or enthusiasm, and in the case of (“talk”) probably talks with no special state. When a token maps to ROLE-INTENT, it can be used to infer ROLE-STATEs in this study’s future steps to produce meaningful animation. Another distinction between ROLE-STATE and ROLE-INTENT is that the existence of a ROLE-INTENT in a scene of a story affects the ROLE-STATEs and other scene elements of the current and also the following scenes, but the presence of a ROLE-STATE in a scene does not have such a long-lasting effect.
The second part of the conceptual scene model includes general specifications such as LOCATION and TIME of the current scene. Each scene (plan) of a story occurs in one specific location according to the division of the story into scenes.
The authors tried to design the scene model elements from the screenwriter perspective. From this perspective, the scene elements can be divided into two separate parts: actors and non-actors. The actors are those elements that can perform an action on the scene (as well as a change in the state). The actors of a scene (plan) maps to ROLE and ANIMATED-OBJECT scene elements. Other non-actor elements of a scene are modeled by STATIC-OBJECT. Both actor and non-actor scene elements can have change in their states in a scene. This modeling is rational from screenwriter perspective and led to 13 proposed scene elements.
Although the actors in the input story could be animals (modeled by ANIMATED-OBJECT), the domain of the selected corpus was literary realistic stories about the prophets. This means that the actors of all stories were human. Hence, the adequacy and coverage of the scene model for the stories with non-human actors were not tested. Selecting the literary stories to collect the dataset means that the adequacy of the proposed scene model for the stories with non-literary writings was not tested.
3.1 Comparison with scene models in other TTSC Systems
Since WordsEye (Coyne and Sproat Reference Coyne and Sproat2001) and SceneSeer (Chang et al. Reference Chang, Savva and Manning2014b) focused on obtaining and correctly visualizing the spatial relations between the objects of a scene, the objects and their properties were the key elements of their scene model. However, the scene model in CONFUCIUS (Ma Reference Ma2006) and (Glass and Bangay Reference Glass and Bangay2008) tried to cover different elements that might have been present in a visual scene. So, they are comparable to the scene model proposed in the current study. Table 4 indicates that there are many elements in common with the scene models in these systems.
The ROLE element in the proposed scene model is modeled with the HUMAN in CONFUCIUS and the Avatar in the Glass’s model. CONFUCIUS models the non-animated objects (STATIC-OBJECT) as OBJ, and humans, animals, plants, and all other animated objects as HUMAN. Glass makes no distinction between an ANIMATED-OBJECT and a STATIC-OBJECT. Both the ROLE-ACTIONs and the OBJECT-ACTIONs are modeled with the EVENT in CONFUCIUS, but Glass only models ROLE-ACTION with Transition. The STATE element in CONFUCIUS is a static situation, which does not involve changes, and usually refers to a fact. The static properties of both HUMAN and OBJ are modeled with the STATE, and their other variable properties are modeled with the PROPERTY. Glass does not model neither the features of the objects nor humans. The LOCATION element is modeled with the PATH and PLACE in CONFUCIUS and with the Setting and Relation (explicit description of a spatial relation) in Glass’s model. The Setting element in Glass’s model also covers the time of the scene, which is modeled with the TIME element in CONFUCIUS and the proposed scene model.
Neither CONFUCIUS nor Glass distinguish between ROLE-STATE and ROLE-INTENT elements. CONFUCIUS has the AMOUNT element, which specifies the quantity of OBJs, which has no equivalent in the proposed model, nor in the Glass’s model.
3.2 A sample scene model mapped from a Persian sentence
Figure 1 shows two snapshots of a sequence from the film “Saint Mary”,Footnote e that is the visualization of the sentence (“Each time Zakariya visited her, he noticed that she was provided with special food in her sanctuary, which caused him to wonder”). In the mapping of this sentence to the proposed scene model (“Zakariya”) and (“she”), which refers to “Saint Mary,” mapped to ROLE; the actions (“was going”) and (“noticed”) mapped to ROLE-ACTION; and the states of (“visit”), (“notice”), and (“wonder”) mapped to ROLE-STATE. The token (“food”) is modeled as a STATIC-OBJECT with (“special”) as its STATIC-OBJECT-STATE. This STATIC-OBJECT-STATE is reflected by non-seasonal fruits, which are bright and shiny in the visual scene. The tokens (near) and (sanctuary) mapped to the LOCATION of the event.
4. Collecting the dataset
In the current study, learning the elements of the conceptual scene model from story tokens was done to map the Persian natural language text to the conceptual scene model. To learn this mapping, a dataset was required which contains the proposed scene model elements mapped to its tokens. The assignment of WSD tags to dataset tokens was also required, as in many TTSC systems (Table 2). To the authors’ knowledge, this required dataset does not exist in the Persian language, so it was collected. To collect the dataset, apart from the tokens of the sentences and their mapped conceptual scene model elements (labels), other information about each token is required.
To prepare these types of information, off-the-shelf Persian NLP modules were required. Syntax analyzer module was available (Shamsfard et al. Reference Shamsfard, Jafari and Ilbeygi2010b), but there was no semantic analyzer module in Persian (Shamsfard Reference Shamsfard2011); only SRL corpora produced in academic research labs were available. Although these corpora were in the development phase and needed to be checked by experts to improve their accuracy, they comprised all of the available data containing semantic analysis information in the Persian language. The sentences in all of these corpora (except one) had general subjects and were collected from news and general texts. Literary real stories about the prophets were the subject of only one of these corpora (Mesgar et al. Reference Mesgar, Hajizade, Darrudi, Farhoodi, Mohamadzade, Alavi, Davoudi, Sarabi and Khalash2014), which was developed as part of the Qur’anic Question and Answer Project (Iran Telecommunication Research Center 2014) by the Iran Telecommunication Research CenterFootnote f (ITRC). The proposed conceptual scene model elements were designed to model the visualization information of a scene and will be used to produce meaningful animation in this ongoing study. The interdependent sentences of the stories, which will be converted into interdependent visual scenes, were preferred for future steps of the current study over independent sentences with general subjects; thus, the ITRC SRL corpus was selected for dataset production purposes.
The ITRC corpus annotated each story token with syntactic and semantic information. The syntactic and semantic analysis methods used in the development of this corpus have been published as the Syntactic Manual of Style (Qur’anic Question and Answer Project 2014b) and Semantic Manual of Style (Qur’anic Question and Answer Project 2014a). The Syntactic Manual of Style was developed using dependency grammar in the Persian language (Tabibzadeh Reference Tabibzadeh2006), and the Semantic Manual of Style was adapted from Propbank 3.0 (Palmer, Gildea, and Kingsbury Reference Palmer, Gildea and Kingsbury2005) and adjusted to the Persian language.
To prepare the intended dataset, WSD tags and mapped scene elements (label tags that the model must learn) should be added to the ITRC corpus. To the authors’ knowledge, no WSD tagged corpus was available in Persian, but these tags must be included in the required dataset. Table 2 shows that all TTSC systems have used a WSD tag in the mapping task. The significant role of the WSD information in both the first and second attempts at solving the mapping problem (designing deterministic mapping rules and learning traditional non-sequential machine learning models, especially rule-based models) confirmed the necessity of the WSD tag. To produce the dataset, the WSD tag and mapped scene model element (label tag) of each token were annotated manually.
4.1 Manual annotation task
In the annotation process, the ITRC corpus containing 7656 tokens in 451 sentences was given to the annotator. She added the WSD tag and the scene model element that are mapped to each token of a sentence. Then, two other annotators checked and corrected the output of the first annotator to minimize the probability of errors. In the corpus provided by ITRC, each token of stories was annotated in CoNLL 2008 format (Surdeanu et al. Reference Surdeanu, Johansson, Meyers, Marquez and Nivre2008). Not all CoNLL 2008 tags were intended for inclusion in the dataset. The unwanted tags of each token were eliminated from the corpus, and the remaining tags were used by the annotators. Table 5 shows the tags prepared for the annotators.
Figure 2 shows the annotation Interface used by the annotators. The columns 1–5 of Figure 2 are the same as rows 1–5 of Table 5. The 6th and 7th rows of Table 5 are multiplied by the number of verbs in each sentence and form the final columns of Figure 2 for each sentence. These columns were in a text file that was given to the annotators (as annotation interface) to manually fill in the wanted columns.
4.1.1 WSD annotation and its challenges
The lexicon used for WSD tagging was FarsNet 1.0Footnote g (Shamsfard et al. Reference Shamsfard, Hesabi, Fadaei, Mansoory, Famian, Bagherbeigi, Fekri, Monshizadeh and Assi2010a). FarsNet is designed to include a Persian WordNet containing about 17,000 synsets in the first phase. The inner language relations established between the senses and synsets of FarsNet are the same as those in WordNet 2.1. This lexicon was developed by the Information Technology Faculty of the Cyberspace Research InstituteFootnote h in cooperation with Shahid Beheshti University.Footnote i
The manual annotation of the WSD tag in Persian stories presents several challenges (Shamsfard Reference Shamsfard2011). To select the appropriate sense, considering sense granularity in FarsNet, great attention should be paid, so it is a time-consuming task. The frequency of words per synset and senses per word is 1.78 and 1.37, respectively.Footnote j Emerging words built by concatenating words and affixes are frequent in Persian due to its derivational and generative nature. These words do not exist in FarsNet and they increased the missing values in WSD annotation. Nearly 60% of the 7656 tokens of the corpus had no equivalent sense, and 50% of these are stop words, with another 10% (non-stop words) missing their WSD tag in FarsNet. It is described in Section 5 that unlike common computational linguistic systems, the stop words have not been removed from the corpus. To limit the set of WSD tag values in the prepared dataset, principal concepts were selected from the top of the concept hierarchy in FarsNet. These principal concepts are shown in Table 6. During manual annotation, the annotators disambiguated each token by determining its equivalent sense in FarsNet (WSD tag) and its hypernym word from the principal concepts list (super-WSD tag).
4.1.2 Scene elements annotation and its challenges
The last tag that should be manually added to the corpus to provide the information required for the desired dataset is the conceptual scene model element tag (label tag to be learned). It should be determined as the mapped scene element of each token from the list: ROLE, ROLE-ACTION, ROLE-STATE, ROLE-INTENT, ANIMATED-OBJECT, ANIMATED-OBJECT-STATE, OBJECT-ACTION, STATIC-OBJECT, STATIC-OBJECT-STATE, LOCATION, TIME, JUNK, and NO. The definition of these scene elements is provided in Table 3. To map each token of a sentence to the proper scene element, the definition of each scene element was given to the annotators, and they were told to imagine the visual scene that is related to each sentence of every scene of each story. The annotators then mapped every token of each sentence to a scene element label according to their imagined visualization.
The JUNK label was designed to model tokens in sentences, which usually belong to the stop word list. The NO label was designed to model tokens in sentences, which are not visualized in the converted visual scene. For example, tokens , , and (translated as “knew,” “interpretation,” and “dream”) in the sentence (“He knew the interpretation of the dream.”) have no corresponding visualization elements; thus, they have been mapped to the NO label. Although tokens with a JUNK label usually have no visualization, their separation from the tokens with a NO label was rational and improved the results’ accuracy. The tokens with NO labels are at times referential nouns, which head noun phrases, while JUNK tokens are not.
The ITRC corpus consists of 12 stories about the prophets. Each story consists of multiple sentences. In order to annotate scene elements, the annotators separated the different scenes of each story. A scene is a plan or shot as specified by the standard terminology of filmmaking. According to this terminology, each script consists of several sequences and each sequence is composed of different plans (Nazari Reference Nazari2006). A plan is one take, that is, the film recorded in the interval from when the camera is turned on and when it is turned off. A plan is the smallest part of a film and structurally is equivalent to a single word in the text. A sequence is a thematic grouping of events in a movie, such as a firing or wedding sequence. Each plan or scene occurs in one specific location, so the factor to separate the scenes of a story is a change in location in which the scene occurs. Figure 3 shows the completed corpus in which WSD, super-WSD, and mapped scene model elements are added to each token of sentences, in columns 6–8, respectively, and different scenes of all stories are marked. A column is added to the left of the figure to show the English translation of each token.
The manual annotation of the scene element tag faced some challenges. An inaccurate mental visualization of a sentence or even misunderstanding of the sentences could cause the tagging to be incorrect. Additionally, ambiguous sentences are likely to exist in each story. These are sentences that can have multiple interpretations or visualizations. Confusion between scene elements such as ROLE-STATE and ROLE-INTENT are more probable because of the similarity of their meanings. A high degree of accuracy should be applied by the annotators to correctly recognize the mapped scene element in order to prevent inappropriate use of the NO label.
The LOCATION and TIME scene elements were two of the most difficult ones to assign. The ArgM-LOC and ArgM-TMP SRL tags of a sentence (if available) show the locative and temporal tokens in a sentence, respectively. These tags can mislead the mapping task because they sometimes denote an abstract (not physical) location or time of the event occurring in that sentence. The token (translated as “in”) in the sentence (“He saw a man in his dream.”) has the ArgM-LOC SRL tag, but this tag does not actually refer to a physical location for the event “dream.” The token (“when”) in the phrase (“when he arrived in town”) has an ArgM-TMP SRL tag, but it does not refer to the actual time at which the “arrive” event occurred. The token denoting the time of a scene may have been present in one or more previous scenes, which could cause the imagining and mapping task of the TIME tag even harder. The correct division of a story to its scenes by the annotators is another challenge, which affects the accuracy of the manual annotation process, especially the LOCATION and TIME tags of a scene.
An additional problem exists when mapping the tokens of a story to scene element tags. The SRL corpus provided by the ITRC was produced as part of a research project and its accuracy must be improved. This corpus originally contained nearly 15,000 tokens. Nearly half of the sentences were wrongly labeled with SRL tags and had to be eliminated from further processing. This decreased the total number of usable tokens of the corpus to 7656 tokens in 451 sentences. These wrongly labeled sentences in the ITRC corpus were scattered throughout the corpus and their elimination meant that many scenes in every story missed some sentences. Although the elimination of these sentences did not harm other sentences, the integrity of the scene that missed some of its describing sentences decreased. This problem, in some cases, made the imagining and mapping task difficult but tolerable for the human annotators.
4.2 The completed corpus and dataset
The manual annotation task of 7656 tokens took nearly ten days for each annotator because finding the WSD tag of each token from FarsNet was a time-consuming task, and the mapping of tokens to scene elements was even more challenging. Table 7 shows an example of a sentence that was manually annotated and its translation into English (column 3). Columns 1–2, 4–6, and 10 are from the ITRC corpus. Columns 7–9 were added by annotators. Column 7 is the WSD tag using the FarsNet lexicon (Shamsfard et al. Reference Shamsfard, Hesabi, Fadaei, Mansoory, Famian, Bagherbeigi, Fekri, Monshizadeh and Assi2010a). Column 8 is the hypernym of the token from the principal concepts list (Table 6), and column 9 is the scene element tag mapped to each token. The average pairwise inter-annotator agreement (IAA) score was 85.5%, 87.9%, and 75.6%, according to kappa value, for column 7–9, respectively. The IAA score is interpreted as substantial agreement if it was between 61.0% and 80.0%, and as almost perfect if it was between 0.81% and 0.99% based on what was reported in Landis and Koch (Reference Landis and Koch1977). The annotators had an almost perfect agreement on WSD and Super-WSD tags and substantial agreement on the scene element tags, which means it was more difficult for them to decide between different scene elements of a scene. The “completed” corpus in the current study, especially its WSD tags, can be used by other researchers to learn models for different NLP tasks.
After completion of the manual annotation task of the corpus, the dataset was produced based on the completed corpus. The FORM, GPOS, DEPREL, WSD, Super-WSD, Scene-Element, and ARG tags of each token of the corpus (columns 2, 4, 5, 7, 8, 9, and 10) were selected for the dataset. The tokens having more than one ARG in the column 10 were duplicated, and only one ARG has been written for every instance of the token. The seven features of each token have been written in one line separated by tab character, and the sentences were separated by a blank line. The different scenes of each story and different stories, which have been marked in the completed corpus, were not marked in the dataset. The dataset has 7946 tokens in 451 sentences taken from 12 stories. The number of tokens in the produced dataset is more than the number of tokens in the completed corpus because of the duplication of the tokens having more than one SRL argument. Figure 4 shows the part of the dataset, which was based on the part of the completed corpus shown in Table 7. The third and fourth tokens in the main sentence of Figure 4 are both related to the third token of Table 7 because it has two SRL arguments.
5. Conditional Random Fields
A CRF is a popular probabilistic method for structured prediction (Sutton and McCallum Reference Sutton and McCallum2012). Structured prediction methods are a combination of graphical modeling and classification. They are able to compactly model multivariate data as graphical models and perform predictions using large sets of input features as classification methods. Traditional classification models predict only a single class variable, but graphical models are powerful in the sense that they can model many interdependent variables. The sequential relation between output variables is perhaps the simplest form of dependency between output variables in graphical models. CRF predicts an output vector $\textbf{y}= \{y_{0}, y_{1}, ..., y_{T}\}$ of random variables given an observed feature vector x when these random variables depend on each other. In the mapping problem at hand, the input is a sentence of a story, which is a sequence of tokens $\textbf{x} = \{x_{0}, x_{1}, ..., x_{T}\}$ . Each $x_{s}$ is a feature vector of a token positioned at index s. The output is a sequence of scene model elements $\textbf{y} = \{y_{0}, y_{1}, ..., y_{T}\}$ where each $y_{s}$ is the scene element mapped to the token positioned at index s of the input sequence (sentence).
Scene model elements mapped from sentence tokens are interdependent. If one token in a sentence maps to ROLE in its corresponding conceptual scene model, it is more likely that other tokens of that sentence will map to the properties of that ROLE, properties such as ROLE-ACTION, ROLE-STATE, or ROLE-INTENT. In Persian language sentences, if a token is mapped to STATIC-OBJECT-STATE, it is very probable that the token before it in the sentence has been mapped to STATIC-OBJECT. This relation holds in reverse order in English sentences. The last token (excluding the punctuation mark) of most sentences in Persian is the action that has occurred in that sentence. Depending on the presence of a ROLE or an ANIMATED-OBJECT in the sentence, it will be denoted as a ROLE-ACTION or OBJECT-ACTION. A sample of these interdependent scene elements is shown in Table 8. Ignoring these sequential relations among the tokens of a sentence can cause some important information about the tokens to be discarded. As a sequential labeling graphical model, CRF can find the globally best label sequence for all tokens of a sentence, which is better than labeling each token of a sentence one at a time (Lafferty et al. Reference Lafferty, McCallum and Pereira2001).
Another reason for choosing sequential modeling to solve the problem at hand is the imbalanced nature of the collected data and the ability of the CRF model to handle this imbalanced data (Sutton and McCallum Reference Sutton and McCallum2012). The proportion of different elements of the conceptual scene model in the prepared dataset was diverse, as shown in Table 9. The proposed scene model consisted of 13 elements (including JUNK and NO labels). The proportion of the JUNK label was 50%, the ROLE and NO elements together comprised approximately 25% of the dataset, and the other ten elements comprised the remaining 25% of the dataset. To preserve the natural sequence of words in a sentence, none of the tokens belonging to the common stop word lists were eliminated from the dataset. These stop word tokens comprised half of the dataset and redoubled the imbalance. These imbalanced statistics were, to some extent, due to the domain of the selected stories, which was real stories about encounters by each prophet with his tribe. ROLE was a frequent scene element in comparison with STATIC/ANIMATED-OBJECT or other scene elements in those stories. The CRF model can handle this imbalanced data and learn the elements with a small number of samples, as in NER (Finkel, Grenager, and Manning Reference Finkel, Grenager and Manning2005) and POS tagging problems (Pandian and Geetha Reference Pandian and Geetha2009). The increase in the average accuracy of the mapping task in the third attempt (sequential modeling), that is, 85.7% compared with the second attempt (non-sequential modeling), that is, 76.58%, confirmed the ability of CRF to model sequential and imbalanced data.
CRF is a graphical model that uses traditional features of each token of a sentence in a dataset to predict the sequence of labels and also uses contextual information of neighbor tokens, such as their features or labels. It uses this contextual information to model the interdependencies among variables. Considering this property of CRF, its feature set can be very large and include the features and labels of one, two, three, or more tokens before and after the current token and their combinations. Nevertheless, an inefficient large feature set can extend the time required for training and can decrease accuracy. Selecting an efficient feature set is a critical success factor in CRF. The selection process of the final feature set for the CRF for the mapping problem at hand is discussed in Section 6.2. CRFsuite software (Okazaki Reference Okazaki2007) was used to learn the CRF model in the current study.
6. Results and evaluation
6.1 Dataset statistics
The collected dataset had 7946 tokens in 451 sentences. These 451 sentences formed 232 scenes of 12 stories. The story lengths varied from 350 to 1500 tokens. Meeting the challenge of incorrect SRL tagging of sentences in the ITRC corpus eliminated some sentences and decreased the story lengths. The use of CRF as a sequential labeling model raises the expectation of a decrease in accuracy for stories with incomplete sentences (stories from which some sentences have been deleted). 15.96% of the 7946 tokens of the dataset (1268 tokens in 66 sentences, 32 scenes, and 2 stories) was separated for the test set. The distribution of each scene element across the training and test sets is shown in Table 9. The 3rd, 5th, and 6th columns show the number of each type of scene element in the collected dataset, training, and test sets, respectively. The last column shows the percentage of each scene element in the test set in proportion to its total number in the dataset. This was done to guarantee that the minimum proportion of the less frequent scene elements in the test set was approximately 12% (except one). 10% of training set was assigned to validation set. Table 10 summarizes the dataset statistics.
Precision, recall, and F1-score accuracy measures were calculated for evaluation. These accuracy measures are used publicly in text classification (Alpaydin Reference Alpaydin2014). The formulas used to calculate these measures are shown above.
TP = No. of correct classifications of positive tokens.
FP = No. of incorrect classifications of positive tokens.
TN = number of correct classifications of negative tokens.
FN = number of incorrect classification of negative tokens.
6.2 CRF feature set
The final selected feature set of CRF in this study contains 15 features, as shown in Table 11. As stated in Section 5, CRF can use the features of the neighbor tokens as contextual features of a token to learn the interdependencies between tokens of a sentence. To select the best feature set for the problem at hand, the collected dataset feature set of the token itself comprising FORM, GPOS, DEPREL, ARG, WSD, Super-WSD, and Scene-Element was selected as the base. Then, the features of neighbor tokens were temporarily added to the feature set so that the average F1-score accuracy measure for all scene elements could be monitored. Whenever adding a neighbor feature increased the average F1-score, that neighbor feature was added to the selected feature set permanently. Otherwise, it was removed. The different features tested during this process and the resulting average F1-score for the feature set are listed in Table 12.
6.3 Evaluation and discussion
The detailed accuracy measures with the selected feature set in Table 11 are provided in Table 13. As listed in Table 9, the ANIMATED-OBJECT-STATE and OBJECT-ACTION scene elements each had nearly 30 tokens in the total dataset; thus, the CRF could not learn them, so they were excluded from the evaluation. The average F1-score for the other ten scene elements (excluding NO as the unwanted label) was 85.7%, which is satisfactory.
Precision, recall, and F1-score measures for some scene elements, including ROLE, ROLE-ACTION, ANIMATED-OBJECT, LOCATION, TIME and JUNK, were satisfactory. The accuracy measures for ROLE-STATE and ROLE-INTENT were lower than others because of the complex distinction of states from intents for the ROLE samples. The confusion matrix shown in Figure 5 confirms this. These two (and also ROLE-ACTION) classes had the most tokens misclassified as NO. Nearly one-third of the ROLE-INTENT samples were misclassified as ROLE-STATE; hence, decreases occurred in ROLE-INTENT’s recall and ROLE-STATE’s precision. The misclassification of the ANIMATED-OBJECT-ACTIONs as ROLE-ACTION decreased ROLE-ACTION’s precision significantly.
The lowest F1-score was for the STATIC-OBJECT-STATE scene element. The STATIC-OBJECT-STATE and ANIMATED-OBJECT were both low-frequency scene elements that had nearly 70 tokens in the total dataset (nearly 0.9% of the dataset), but the ANIMATED-OBJECT scene element was learned very well. This means that the selected feature set contained enough information to recognize the ANIMATED-OBJECT scene elements from others. For the STATIC-OBJECT-STATE, the few tokens in the training set prevented CRF from recognizing it from others. Only 2 of 12 actual STATIC-OBJECT-STATE tokens were correctly predicted, and nearly half of them were misrecognized as ROLE-STATE. However, CRF did not significantly confuse other scene elements as being STATIC-OBJECT-STATE. In the case of STATIC-OBJECT, the tokens were wrongly misclassified as STATIC-OBJECT-STATE, LOCATION, and NO, which resulted in a reduced recall. The presence of STATIC-OBJECTs in the description of LOCATIONs is a reason for confusion, as in “at the bottom of the sea,” in which “sea” is annotated as a STATIC-OBJECT and “bottom” is annotated as LOCATION. Despite the low number of TIME tokens in the test set, only one of its tokens was misclassified as NO, and some NO tokens were wrongly predicted as TIME, which slightly reduced the precision.
6.4 Feature ablation study to expand the dataset
The dataset collected in this study was limited to 7946 tokens because of the limited number of tokens in the SRL annotated corpus. WSD and Super-WSD tags also were annotated manually. The lack of more SRL annotated sentences and the costs of manual annotation of additional data are two barriers to expand the dataset in the future steps of the study. A feature ablation study was done to understand the proportion of these two (and other) annotation types to system performance. Each of the main annotation types, FORM, GPOS, DEPREL, ARG, WSD, Super-WSD, and their related features were removed one at a time, and then, the system performance was measured. It is shown in Table 14 that removing the token itself (presented as FORM feature) and the Super-WSD feature reduced the average accuracy significantly. Conversely, removing the SRL feature (presented as ARG) caused the minimum accuracy reduction. These results showed that to expand the dataset for the future steps of the study, the existence of SRL tagged sentences is not critical, but manual annotation of WSD and Super-WSD of tokens could not be ignored.
6.5 Comparison with non-sequential learning
As stated in Section 1, in the second attempt to map Persian natural language story text to conceptual scene model elements, non-sequential machine learning models were applied. These models predict the scene element mapped to each token of the story one at a time. In this phase, Decision Table (Kohavi Reference Kohavi1995) and ZeroR models were learned as rule-based classifiers and Decision Stump and J48 (Quinlan Reference Quinlan1993) models were learned as decision trees using Weka 3 software (Frank, Hall, and Witten Reference Frank, Hall and Witten2016). The SVM model was also tested. The dataset used in this phase was produced based on the completed corpus and consisted of FORM, GPOS, DEPREL, ARG, WSD, Super-WSD, and Scene-Element, as the dataset used for learning the CRF model, with this difference that tokens of all the sentences were concatenated. The division of the dataset into training and test sets was similar to that was used for the CRF model. The accuracy measures of these non-sequential models and the corresponding accuracy measure of the CRF model are shown in Table 15. The best average accuracy with this feature set was 76.58%, which was for the J48 model and was lower than the average accuracy of the CRF model (85.7%). The tokens with the JUNK label (nearly half of the dataset) were not removed in this dataset, but their removal is common in non-sequential modeling to prevent fake accuracies. As a second test case, the tokens with a JUNK label were removed from the dataset, which significantly reduced the average accuracy of the best non-sequential model to 60.70%. Preserving the natural order of tokens in sentences prevented application of this change for the CRF model. The imbalanced nature of the scene elements in the collected dataset, shown in Table 9, was one reason for the low performance of these non-sequential models. A filter was applied to the dataset in the first test case (the most comparable case to the CRF dataset; dataset which includes JUNK tokens) to artificially balance the number of tokens of each scene element to approximately 600. Applying this filter reduced the accuracy measure of each scene element (excluding JUNK and NO) minimally 6% and maximally 30%. The results are shown in the fifth column of Table 15. A significant decrease in the accuracy of all models confirms the natural imbalance of the scene elements mapped from the story tokens, which could not be handled by the artificial balancing of the dataset.
The detailed accuracy measures of the non-sequential model with the best results, J48, with the first test case dataset are shown in Table 16. The strengths and weaknesses of this model for the prediction of mapping between tokens and scene elements are similar to CRF. ROLE, ROLE-ACTION, ANIMATED-OBJECT, LOCATION, TIME, and JUNK were the best scene elements learned by J48. All scene elements were learned better by CRF. The precision of STATIC-OBJECT-STATE and recall of ROLE-ACTION increased significantly for J48. Nearly all accuracy measures of STATIC-OBJECT were decreased in comparison with CRF. This means that discarding sequential information about tokens, such as their position in the sentence and the scene elements of the neighboring tokens, caused misclassification of other scene elements as STATIC-OBJECT. Non-sequential learning did not recognize the tokens with STATIC-OBJECT-STATE and ROLE-INTENT labels and misclassified them as other scene elements.
7. Conclusion
The present study is part of PERSIS MEANS and maps Persian natural language text to a conceptual scene model aimed at generating meaningful animation based on an input story. The data-driven approach of learning a CRF model was used for the mapping task. To the best of the authors’ knowledge, the required dataset does not exist; thus, as the headmost TTSC system, which converts Persian text, a dataset was collected for this task. This process faced challenges such as a lack of required off-the-shelf NLP modules and a significant error rate in the output of available modules or corpora. Some of the required information was available in a corpus with a limited number of tokens. Human annotators manually annotated the available corpus with the required information for the intended dataset. Evaluation of the results showed acceptable accuracy for the mapping task. The next stage of the research will enrich the conceptual scene model mapped from the input text, in this stage, with conceptual factors to enable the production of meaningful animation.
Acknowledgments
We acknowledge the Iran Telecommunication Research Center (ITRC), specially the Qur’anic Question and Answer Project team for their great help and providing the NLP modules required for this research. We would also like to express our gratitude for Mojgan Farhoodi, Ehsan Darrudi, Maryam Mesgar and Meisam Ahmadi for their invaluable guidance and consultation.