Policy Significance Statement
This paper explores the transformative role of AI-powered language automation in managing online customer feedback, focusing on how AI tools can support the process of responding to reviews. By analyzing both human and AI-generated responses, the study offers insights into the evolving roles of businesses, feedback management providers, and AI systems in shaping customer interactions. The findings have critical implications for policy, particularly in the areas of transparency, information quality, process efficiency, and accountability in customer service. This research contributes to the broader discourse on AI governance, assisting policymakers and organizations in addressing the challenges and opportunities that arise from integrating intelligent tools into customer relationship management.
1. Introduction
The ongoing digitalization affects many areas of our lives, including customer relationships. Customers rely on online reviews from rating platforms, search engines, social networks, and company websites to make purchasing decisions, increasing the number of reviews (Statista, 2022). This phenomenon, known as electronic Word-of-Mouth (eWOM) (Hennig-Thurau et al., Reference Hennig-Thurau, Gwinner, Walsh and Gremler2004), directly impacts the success of a company (Ye et al., Reference Ye, Law and Gu2009), similar to traditional Word-of-Mouth. The hospitality industry is no exception. Although the hospitality industry, especially hotels and restaurants, is a typical “offline” business, many steps of the process are carried out online: before the experience (collecting information, planning, booking) and afterward (exchange, evaluation). Hence, companies can no longer afford to ignore their online image (Hallmann et al., Reference Hallmann, Zehrer and Müller2015; Liu and Lee, Reference Liu and Lee2016).
Businesses attempt to benefit from eWOM and thus perform online customer feedback management (CFM). Representatives of the hospitality industry become active and use online contact points to interact with their customers by responding to online reviews (Deng et al., Reference Deng, Lee and Xie2021). Online reviews are accessible to Internet users and have value for both companies and customers. By responding to those reviews, businesses can improve existing customer relationships, regain lost customers, and attract new ones (Chevalier and Mayzlin, Reference Chevalier and Mayzlin2006; Ye et al., Reference Ye, Law and Gu2009). Customers benefit from directly contacting the company to provide feedback and address issues.
However, responding to customer feedback is a considerable challenge for businesses. A good response to an online review requires know-how from the authors. In addition, businesses demand human and time resources for maintaining online CFM. While large companies have the necessary resources, smaller companies often rely on the services of external CFM providers. Such providers offer a variety of online CFM services, including centralized collection, analysis, and response to online reviews.
The response authors at external CFM providers or the responsible employees within the company face complex tasks. They must carefully read and analyze the reviews collected, tag relevant content for further evaluation, and craft responses that meet the quality requirements and company guidelines. The challenge lies in maintaining creativity and individualization in responses while avoiding fixed and repetitive phrases, especially given the increasing number of online reviews (Statista, 2022).
The latest artificial intelligence (AI) technologies, especially natural language processing (NLP) tools, can support the CFM process. The state-of-the-art NLP models can automate or at least augment the review evaluation through intelligent analysis of their texts. Text generation has the potential to facilitate the process of writing responses. However, how these technologies can be integrated into the complex process of online CFM and to what extent human authors’ role in this process is eliminated remain open questions. Therefore, this study addresses the following research question:
1.1. How can the process of responding to online customer feedback be augmented by intelligent tools?
We are pursuing the research question within an industry project, aiming to assist authors and businesses in responding to customer reviews. The project team developed an intelligent system that incorporates advanced NLP solutions. The integration of AI tools redefines human roles and working practices, fostering a new era of human–AI collaboration. The results concerning the research question provide valuable insight into how intelligent tools can enhance the process of responding to online customer feedback. Although full automation of the response process remains out of reach, the introduced Response Generator, combined with the Quality Score, plays a crucial role in shaping new authoring practices. However, as these tools are increasingly incorporated into CFM, their impact can go beyond operational performance.
We examine the designed process and how the various tools support the work of human authors. Based on our project partner’s data on the use of generative AI over the course of a year, we examine the division of labor between humans and generative technologies in the loop. Their integration of AI tools raises important considerations about the potential policy implications of their use, particularly in areas such as transparency, authorship, and accountability.
The results of this study have implications for various stakeholders involved in online feedback management, including managers, authors, IT developers, and the operator of online feedback platforms. We address key policy considerations for businesses and regulators that occur due to automation of the response process.
2. Related work
To investigate the potential of IT support for the creation of review responses, we analyzed the literature on NLP in CFM. For the design and organization of human–AI collaboration, we turn to the literature on engineering science, focusing on the control theory of dynamic systems.
2.1. NLP in CFM
Most of the online customer feedback is in the form of unstructured text. As a research field concerned with the use of computers to understand and manipulate natural language (Chowdhary, Reference Chowdhary and Chowdhary2020), the ability of NLP to support human actors (Coenen et al., Reference Coenen, Davis, Ippolito, Reif and Yuan2021; Wiethof et al., Reference Wiethof, Tavanapour and Bittner2021) in the writing process or in the evaluation of texts has received a lot of attention in recent years. To facilitate the tasks of CFM, we can make use of several NLP technologies such as text generation, sentiment analysis, content analysis, and named entity recognition. They are crucial to analyzing the content of online reviews and generating a high-quality response to them effectively and efficiently. The conclusive analysis of the online image of a company can only be based on detailed data on the content of the review text. This information focuses on both the sentiment and the aspects mentioned in the review. It is crucial to automatically extract this information of different granularity considering the number of reviews. Thanks to the latest NLP research approaches, we can benefit from a fine-grained aspect-based sentiment analysis (Tang et al., Reference Tang, Fu, Yao and Xu2019). The extracted data can be beneficial for further response generation and final response quality evaluation.
Synchronous feedback communication channels enjoy the support of chatbots or conversational agents. Chatbots have been used in various contexts to collect and address customers’ feedback swiftly and cost-effectively (Adam et al., Reference Adam, Wessel and Benlian2021; Omisakin et al., Reference Omisakin, Bandara and Kularatne2020; Sidaoui et al., Reference Sidaoui, Jaakkola and Burton2020). These often AI-based systems communicate with human customers interactively via natural language. However, they often fail to meet customer expectations, mainly due to inconsistent messages to the customers (Adam et al., Reference Adam, Wessel and Benlian2021).
Combining human-like design with NLP technologies could promise a service that is always available and of high quality, very close to real customer service (Diederich et al., Reference Diederich, Janssen-Müller, Brendel and Morana2019). Innovative services of this kind are also referred to as “bots-as-a-services.” Unlike responses to short messages, generating a response to complex texts, which are also reviews, is a challenge. Such text generation is a long-standing research domain for NLP experts, and some solutions are available. Language models such as BERT (Devlin et al., Reference Devlin, Chang, Lee and Toutanova2019) and GPT (Floridi and Chiriatti, Reference Floridi and Chiriatti2020) can predict the appearance of subsequent text tokens based on an input text. These models enable AI systems to continue human-written text or even generate entire paragraphs. There are also attempts to generate review responses with the help of NLP (Katsiuba et al., Reference Katsiuba, Kew, Dolata and Schwabe2022).
Individual NLP researchers propose partial solutions on how the review analysis or review response can work; however, how the high-quality requirements on review answers can be fulfilled remains unsolved. For example, responses must exhibit some level of empathy and individualization. They must be responsive to personalized information from the review and offer business reactions in some circumstances. NLP technologies can handle various tasks, but their integration into a productive system for automation and the role of human actors within it remains unexplored.
2.2. Control theory
The control theory describes one approach for developing feedback systems. It deals with the control of dynamic systems in technical processes and machines. It aims at developing an algorithm or a process that controls the system depending on the system inputs and brings it to the desired state. The design minimizes errors and delays and achieves an optimal stable condition. For this purpose, it requires the sensor to measure the defined process variable (PV) and the controller with appropriate correction behavior. Figure 1 illustrates the control theory schematically. The controller constantly monitors the control PV measured by the sensor and compares it with the reference value or setpoint value (SP). The SP-PV error signal is the difference between the PV and the setpoint value. The controller uses this error as feedback to generate a control action that brings the controlled PV to the same value as the setpoint value. The described measurements and adjustments in the system are only possible if controllability and observability are given. Those parameters represent the basis for the advanced type of automation potential of the process (Doyle et al., Reference Doyle, Francis and Tannenbaum2013).

Figure 1. The control theory components.
Control mechanisms can be implemented by creating an appropriate control loop. The theory distinguishes between open- and closed-loop controls. Open-loop control systems do not connect the control action of the controller with the “process output.” In closed-loop control systems, the control action is dependent on the system output. This system describes a set of physical or software instruments that automatically control a PV to the desired state or setpoint without human intervention. In this sense, closed loops contrast with open loops, which may require manual input.
Initially described in the 19th century, the control theory finds applications in various disciplines like engineering, sociology, linguistics, and dynamic programming. Control systems can be categorized according to the number of inputs and outputs. The simplest system, SISO, has a single input and a single output, while the most complex systems are multi-input–multiple-output (MIMO) systems. MIMO systems are commonly used in computer sciences to simulate the most complicated systems, such as the human body, climatic models, and nuclear reactors.
We argue that control theory offers a suitable framework for understanding and designing sociotechnical CFM processes, where tasks can be assigned to humans, machines, or combinations of both. The diversity of online reviews and the requirements for response quality make it currently impossible to omit human actors from the response process in CFM. Although integrated NLP-based tools can potentially accomplish some tasks of human authors, the final review responses still must be controlled by human authors, based on the feedback from the customers or the administrator. It results in a dynamic human–AI collaboration system that requires an effective and well-designed control mechanism for the feedback system. Applied to the CFM process, control theory provides guidance for the required components, the organization of the process steps, and the management of the feedback loop.
3. Methodology and data collection
3.1. Problems and solution objectives
We addressed the research questions in cooperation with the Swiss management-owned start-up re:spondelligent, hereafter referred to as the feedback management provider (FMP). The company launched the “Smart Responses” project in collaboration with two research partners specializing in computational linguistics (CL) and information management (IM), along with an IT company. FMP offers CFM services for the hospitality industry, including the collection and analysis of online reviews, as well as the writing of customized review responses. This study specifically focuses on FMP’s review answering service.
Initially, human authors were responsible for formulating review responses. To better understand the challenges in this process, we conducted a series of in-depth interviews with three authors and two FMP managers to identify key problems and explore potential solutions. The interview data was qualitatively analyzed and the findings were further refined in a workshop with the FMP administrators to clarify the solution objectives and design ideas for improving the review response process.
3.2. Design and development of “Smart Responder”
The study is part of an extensive Design Science Research (DSR) project (Hevner et al., Reference Hevner, March, Park and Ram2004), which includes six iterative steps, starting with problem identification, followed by design and development, and ending with communication. In carrying out our study, we largely follow Peffers’ DSR methodology (Peffers et al., Reference Peffers, Tuunanen, Rothenberger and Chatterjee2007). The overarching goal of the project is to integrate AI into CFM within the hospitality industry, while also contributing to a deeper understanding of human–AI collaboration in co-writing scenarios.
In the chapter “Problem Identification and Solution Objectives,” we detail the problems identified and propose solutions that were subsequently implemented. The evaluation of these solutions was carried out in two iterations. In the first iteration, we ran an experiment with human authors immediately after the development of the solution, focusing on evaluating the process and the effectiveness and usability of the developed solutions. In the second iteration, we analyzed usage data of the new solution over a 1-year period to evaluate the long-term impact of the new solution. These evaluation steps are described in detail in the following sections.
3.3. Evaluation first iteration
The evaluation step is crucial in the DSR approach to assess the effectiveness of our system and its support of the authors during the response process. Following the Framework for Evaluation in Design Science (FEDS) (Venable et al., Reference Venable, Pries-Heje and Baskerville2016), we selected an evaluation strategy and organized a two-step evaluation process. The first phase involved professional authors from our project partner FMP, who integrated the evaluation into their normal work tasks. The authors used the “Smart Responder” tool to generate responses to real-life reviews in German and English. Due to the COVID-19 pandemic, the number of active professional authors was limited (n = 4, referred with IDs P0–P3). In the second phase, we invited master-level students (n = 14, referred to with IDs A/BC/0-A/B/C3) of a computer science seminar, “The Future of Work,” to participate in a role-playing simulation, where they formed test companies similar to FMP. Each test company consisted of 3–4 authors and one administrator. The authors of these test companies responded to assigned reviews under the supervision of administrators who distributed the reviews and controlled the quality of the final response. The evaluation lasted 3 days for each group, with a set number of representative reviews (n = 164) sampled from the FMP database based on language (German or English), domain (hotels and restaurants), and star rankings.
After the evaluation, all authors completed an online survey. Each questionnaire covered topics related to the objectives of the solution (system usability, support provided by the new system tools) and the self-perception of the authors while working with it (well-being, efficacy (Salanova et al., Reference Salanova, Llorens Gumbau, Cifre, Martínez and Schaufeli2003)). The questionnaires were conducted in English. Additionally, we gathered data on AI support in the “Smart Responder.” We also gathered data on AI support in the “Smart Responder” through remote personal interviews. The interviews were transcribed and a bottom-up coding approach was used to identify themes and address the issues raised by the respondents (Saldana, Reference Saldana2021). One researcher coded the data, while two senior researchers provided supervision, ensuring coding quality and consistency through iterative checks and discussions of edge cases.
3.4. Evaluation second iteration
To gain a clearer understanding of the impact of AI integration, we analyzed data from the year in which the new AI tools were actively used. The second evaluation iteration, conducted 1 year later, focused on how AI integration had influenced the roles of the various participants in the review-response process. The project partner provided a comprehensive dataset that covered the period from August 1, 2023, to July 31, 2024, consisting of 21,383 unique review–response pairs in both German (69%) and English (31%). The reviews were sourced from multiple platforms, including Google, Facebook, and TripAdvisor, and we focused solely on reviews where the final response had already been published online, ensuring that each response had gone through the entire review-edit-publish cycle.
Each review-response pair in the dataset offered valuable insights into the collaboration between various actors in the review-response process. By analyzing the differences between the initial drafts and the final published versions, our goal was to determine the role that each actor played, whether human authors, AI systems, or business representatives, in shaping the final response. This analysis also provided insights into the extent of participation by human actors in the evolving loop of review–response formulation, especially in cases where multiple contributors were involved. To quantify the degree of similarity or variation between the initial and final versions, we applied the Levenshtein distance and Jaccard similarity metrics, well-established algorithms for measuring differences between two texts.
4. Design and evaluation first iteration
4.1. Problem identification and solution objectives
The following scenario describes the role of FMP and the process addressed by the project. Imagine a medium-sized pizzeria “Roma.” Like most businesses in the hospitality sector, they already have a profile on TripAdvisor and Yelp. They are aware of the importance of guest feedback, and in order to improve its online image, the pizzeria decides to outsource CFM to the external CFM provider FMP. FMP employs professional authors familiar with the process of responding to reviews. Every day FMP first retrieves online reviews from “Roma” with the help of its own software solution and APIs. After collecting reviews, the FMP manager (Administrator) identifies which reviews require a response based on the specifications of the pizzeria and assigns it to a response author based on availability, skills, experience, and review difficulty. The review author (for instance, Peter) receives assigned tasks and has 24 hours to formulate response suggestions. Replying to the online reviews presents a particular challenge for Peter. Like all professional authors, Peter first reads and analyzes the unanswered review and identifies the issue (the criticism or praise) mentioned in the text. Since Peter does not know the pizzeria, he uses the FMP application to find out more about the business (opening hours, location, and comments from the pizzeria about temporary offers, events, possible problems, etc.) and collects additional data. Afterward, Peter formulates a possible response. Naturally, this response must be personalized and meet the requirements and specifications of the pizzeria “Roma.” The administrator validates the submitted response. If the quality of the response is good, the administrator forwards it to the pizzeria “Roma.” In the case of an insufficient response, the answer may be edited directly by the administrator or sent back to the author for revision. However, the final decision about the publication makes the pizzeria “Roma.” The process is depicted in Figure 2 and results in high-quality manually formulated responses that satisfy FMP clients.

Figure 2. The current responding process at FMP.
The described process poses FMP and its employees with considerable challenges. Professional authors, incl. Peter, need to meet all quality requirements and write the best possible and up-to-date responses. The existing FMP application does not provide any automation, and professional authors handle all steps manually, which takes much time and makes the process expensive. Finally, FMP employees must provide the answer within a limited time frame. This requirement creates a logistical and qualitative challenge for FMP, whose staff struggles to meet the demand in an acceptable time. This study aims to establish a human–AI collaboration system that leverages the strengths and skills of both actors to improve CFM. The goal is to create an intelligent system that effectively utilizes multiple inputs (reviews, venue information, venue comments) to generate high-quality review responses while optimizing the resources of professional writers.
During the first iteration, discussions with the CEO and administrators of the FMP helped us identify key challenges and define the corresponding solution objectives. These objectives include improving the quality of information, increasing process efficiency, enhancing transparency, and defining responsibility for quality. Table 1 outlines the challenges alongside the respective solution objectives and design ideas aimed at achieving these goals.
Table 1. Problems in CFM and corresponding solution objectives with design ideas

The existing FMP software enabled partially structured storage of venue data in a “notebook” format. However, the new solution aims to enhance the quality of information by shifting the focus toward improving the review–response writing process. The core objective is to make the writing process more efficient by providing authors with support during response formulation and ensuring the quality of the final text. According to control theory (Doyle et al., Reference Doyle, Francis and Tannenbaum2013), effective automation requires the system to address both observability and controllability. Since the goal is to produce high-quality responses, the quality of the responses serves as the key PV that must be monitored and controlled. To achieve this, the resulting system clarifies quality responsibility and introduces a measurable PV that can be regulated. To maintain high standards for responses published online, we implemented a two-step quality control mechanism. This mechanism involves both FMP and the hospitality company, ensuring that each response meets the desired quality. Additionally, the hospitality company will oversee direct online communication with guests, ensuring that their engagement aligns with brand standards and customer service expectations.
4.2. Design and development of “Smart Responder”
In the “Smart Responses” project, we created the “Smart Responder” system to tackle the identified issues and achieve the solution objectives. This AI-based MISO control system is designed to enhance the response process by effectively combining human author skills with AI capabilities. Figure 3 illustrates the innovative approach of using advanced NLP technologies to organize the response process of the FMP.

Figure 3. The novel AI-assisted responding process.
In the first step, the Administrator assigns the reviews to authors based on their experience and availability. An intelligent sentiment and content analysis tool supports review analysis. After this, a smart response generator preformulates the review responses. The “Smart Responder” utilizes the stored venue information to finalize the answer text. The last AI-supported step estimates the quality of the final review response. During the subsequent proofreading phase, the professional writer uses detailed quality information.
Application redesign and enhanced use of venue information. “Smart Responder” supports the SO “Information Quality” and changes the design of the existing software solution. It categorizes and structures venue information (food, venue, ambiance, service, etc.), providing authors with access to company details and the ability to add comments. To handle the volume of data, a mechanism is implemented to present authors with new information about the company they have not seen before (in a pop-up window). This information builds on the historical data: whether the author has already written a response for this location and whether there have been any new contributions to this venue since the last time. Moreover, the “Smart Responder” utilizes review content to filter and prioritize relevant venue information. To evaluate how well the solution achieves the objective, we measure the usability score of the entire system (Bangor et al., Reference Bangor, Kortum and Miller2008).
Sentiment and content analysis. To address the tagging efficiency problem, the ‘Smart Responder’ improves the review tagging process. It includes a customized sentiment and content analysis engine that automatically identifies and classifies aspects in guest-written online reviews that need to be addressed in an authored response. The solution extracts and analyzes text spans related to fine-grained aspects such as food, service, and facilities. Sentiment analysis is performed on these spans using state-of-the-art text classification techniques. The engine is trained on annotated data manually labeled by FMP.
Difficulty score for the reviews. “Smart Responder” supports the authors’ work by estimating how difficult it is to answer the review. For the difficulty score calculation, we use the following parameters: meta-information such as review length, rating, language, platform, type of venue, and the results of sentiment analysis and content analysis of the review text. This score uses FMP historical data and predicts the difficulty of a review on a scale from 1 (easy) to 3 (difficult). The rating is displayed as part of the evaluation information and is easily identifiable by traffic light color coding.
Quality score for responses and quality checker. To address the authors’ need for feedback and improve response quality transparency, the “Smart Responder” incorporates a quality assessment mechanism. This mechanism introduces a PV, the quality score, to evaluate the responses. A quality score is calculated based on the quality characteristics specified and formalized in a workshop with FMP administrators, prioritized, and later weighted during an iterative process. The characteristics include five main categories (grammar, language, length/structure, content, and greetings/farewells) and 22 characteristics, e.g., language errors, incorrect punctuation, structure, responsiveness to negatives/positives, etc.). The Quality Checker, acting as the feedback control system’s sensor (Doyle et al., Reference Doyle, Francis and Tannenbaum2013), calculates the quality score based on the review, business information, answer, and sentiment analysis results. The score, ranging from 0 to 100, provides authors with feedback on the answer quality.
For automation purposes, the control system requires a threshold (SP). FMP managers establish target values for each quality category, resulting in a final target quality score of 85. The Quality Checker compares the output quality score with the threshold value to determine if a response can be submitted. Initially, the quality checker provides results (per category) in a concise format. If quality is sufficient or insufficient, users can access a detailed overview of errors and suggestions for improvement for each quality characteristic.
Response generator. The tremendous potential for the “Smart Responder” in achieving the SO “Process efficiency” lies in supporting authors in the formulation of response texts. The system integrates an intelligent Respond Generator that solves the task as a sequence-to-sequence (seq2seq) modeling problem. Respond Generator uses BART, a self-coding denoising model, and a dataset of about 8000 pairs of online reviews assembled from responses written by the FMP authors. The proposed generator tool is data driven and includes three main processing steps: preprocessing followed by generation and post-processing (Katsiuba et al., Reference Katsiuba, Kew, Dolata and Schwabe2022). The “Smart Responder” user sees the results as soon as the assigned review appears and can apply, edit, or completely overwrite the pre-written text.
“Smart Responder” provides FMP employees with a digital workspace that consolidates all described components into a single, well-organized interface (see Figure 4).

Figure 4. Implemented “Smart Responder”: Author interface.
4.3. Evaluation
The implemented artifact seems to be well accepted by both subject groups. Professional authors highly value the application for its modern design, new functionality, and user-friendliness. They appreciate the improved work structure and the convenience of not having to scroll extensively to find the necessary information. The author P2 finds the new system easy to get into and “quite self-explanatory […] It was very […] easy to get started” (P2). The new system is described as easy to understand and navigate, aiding their workflow. The introduction of venue and author comments allows for more information to be incorporated. The AI tools, particularly the quality checker, are well-regarded (P0, P1). These positive impressions align with the measured system usability score (SUS G = 85). On the other hand, the new authors seem to have high satisfaction with the app but lower than the professional authors, which is reflected in the lower SUS score (SUS G = 67). They particularly appreciate the application’s structure and find the available information helpful in crafting responses, confirming the fulfillment of SO “Information Quality.” The authors explicitly emphasize that they rely on and use the information available in the system. “I really appreciated it, especially when the clients mentioned some specific things, then I could check the comments. Sometimes I could use some information from there to give advice for the next visit” (B4). As a result, almost none of them search for information about customers online. Moreover, the authors mention that filling and updating such information is a potential field for additional AI-driven tools (P1).
Opinions on the Sentiment and Content Analysis tool vary among subjects. Professional authors generally find it useful, with only a small percentage of cases producing incorrect output: “only 10% of the cases were a wrong output and the rest are properly categorized” (P1). Nevertheless, they believe that, in general, authors do not need the sentiment and content analysis “to get the job done” since they need to read a review themselves (P0). Another author appreciates the tool’s ability to highlight positive and negative phrases, allowing for a quick understanding of the review. For example, when the complete review is marked in green with single red spots, the author focuses on writing a positive answer and only looks more closely at the objections and addresses those negative points (P2). New authors, on the other hand, need time to familiarize themselves with the content categories and tagging concepts, and the different colors used in the tool initially confused them. Overall, both tools received limited acknowledgment from the authors, with less than 50% feeling supported by the content and sentiment analysis outputs (see Figure 5).

Figure 5. Perceived support of “Smart Responder” components.
The Quality Score, along with the Response Generator, has a significant impact on author practices. The authors value the feedback provided by the quality checker and appreciate having direct insight into their performance. Professional authors, who are already familiar with the process, welcome the introduction of a quantified value to represent quality. For novice authors, the quality score provides guidance and helps them identify mistakes, although they still have the autonomy to determine what constitutes a good response.
The authors identified some weaknesses in the tool implementation that could impact authoring practices. One concern raised is the tool’s limitation in handling multiple languages within a single response. Specifically, author P2 sees a problem that the tool is not trained to work on reviews with Swiss German (which many customers in Switzerland expect) and company-specific spellings (e.g., all lowercase letters in a particular sentence) as valid. It can lead to false grammar error marks. The authors highlight the importance of allowing corrections and exclusions of such language-specific mistakes during the calculation of the Quality Score. However, the current version of the tool lacks a feedback option to address these issues, such as “it is not a mistake” or “ignore this mistake,” which the authors suggest implementing in future versions (P0). The tool has a weakness in its motivating role for authors. Although it directly impacts their performance, the quality checker only focuses on highlighting errors and lacks positive reinforcement. The tool’s approach of “no news is good news” without providing praise or acknowledgment is seen as a fundamental error that needs adjustment, according to author P2. Furthermore, the logic applied by the quality checker is not transparent (“it is not quite transparent”—P2) and can be frustrating and demotivating for authors, especially when its results are not immediately understandable. Despite these weaknesses, 80% of the authors still found the quality estimation tool helpful.
The Response Generator is highly appreciated by the authors as it provides them with the initial version of the response. The quality of the generated texts varies depending on the difficulty of the review. The easiest reviews are positive reviews with a five-star ranking, while the most difficult are the negative or neutral reviews with a 2–3 stars ranking (P0).
Opinions about the response generator’s performance could hardly be more diverse. It is generally effective for simple reviews from existing customers but falls short in providing personalized responses for more difficult reviews. “Unfortunately, an auto-generated response is often a burden as it looks good and correct, but it is not” (P0). Therefore, the answers to the challenging reviews may rarely be utilized completely. The automatically created text often has content-related mistakes (for example, wrong name of the restaurant/hotel or wrong location in the response’s body) or even spelling errors. It is the main body of the response that is most often faulty. However, it is rather positive that the machine-generated text gives a structure, where the salutation and the greeting are almost always correct.
Despite the errors and problems with generating customized text, the authors answered the question, “How is the overall quality of the generated responses? Where 1—very poor and 10—excellent,” with an average value of mean = 7.07 (SD = 1.33). They see inaccurate texts as facilitation of routine work. The response generator is “quite an imperfect tool, which nonetheless saves time despite its faults” (P2). It is easier to correct the ambiguous response than write it from scratch. “It is clearly much more difficult if you do not have anything if you have to write everything yourself from the beginning […] It also takes more time than when you correct adaptively. Adjusting is always easier than writing everything yourself” (P2). While the professional authors are mostly skeptical of the generated responses, the newcomer authors rely heavily on the AI-provided structure and only make small changes themselves. “My role was kind of the teacher of the AI. I just double-checked what it wrote, corrected it, and then sent it. The role of the AI was to prepare everything so that I only had to check it” (C3). The danger of the current version of the generator is those automatically generated responses may be “full of bad content written correctly” (P0). Therefore, the authors indicate a need for self-assessment for AI-generator.
The Difficulty Score is the feature that all authors considered to have the least impact on their work practices—albeit very few—used it as a rough guide to determine how much time to allocate for a response. “It helps you to get a general idea of how difficult this review is going to be […] I just noticed what the difficulty score is if it was very hard, and I knew I am gonna invest some more time here” (B1). Professional authors, with their experience, do not find the score relevant as they can handle any level of difficulty efficiently. The score may not be relevant for the authors, but it can benefit further process engineering.
5. Design and evaluation second iteration
5.1. Design and development
Over the past year, the performance of large language models (LLMs) has significantly improved. The advances in language models have transformed the business model of FMP. One year after the introduction of the new solution, FMP offers several options for composing responses to customer reviews: (1) Standard texts for simple reviews: For straightforward reviews, hotels and restaurants can define predefined standard texts that are published under their name. (2) Human-generated responses: For more complex reviews, responses are initially written by FMP authors and may be edited by business representatives (e.g. hotel or restaurant managers) before being published. (3) AI-generated responses: Another option for complex reviews involves responses that are generated by the FMP’s AI-based response generator. These are then revised and published by business managers.
These changes have impacted the solution objectives, particularly in the areas of Quality Responsibility and Transparency. While most objectives remain relevant, the improved quality of LLM outputs has reduced the need for professional writers for less demanding reviews. In certain cases, there is now only single-step quality control, as the improved quality of LLM outputs no longer requires professional response writers for less demanding reviews. As a result, the quality control process, which was previously outsourced to FMP, can now be performed internally by the hospitality businesses. In the updated process (see Figure 3), this control function is completely taken over by the hospitality company. However, this streamlined approach only applies to outputs from specially trained LLMs and not to all randomly generated responses. The responsibility for guest communication remains in the hands of the hospitality businesses.
In the initial phase of the project, we discussed the issue of authorship transparency with FMP managers. However, due to the limitations of review platforms (which do not allow authorship identification) and the shift to external quality control by hospitality companies, the proposed transparency measures were not implemented. With the development of LLMs, the concept of authorship has evolved, making it increasingly difficult to distinguish between human-written and AI-generated responses. As AI-generated content becomes more prevalent, it is increasingly important to ensure transparency about what has been written by humans versus AI. Understanding the roles of those involved in the writing process, as well as the proportion of human versus AI-generated content, is essential for addressing transparency in human–AI collaboration in CFM.
5.2. Evaluation
To assess the contributions made by different response authors in the formulation of the final response text, we analyzed the number of modifications applied to both human-written and LLM-generated outputs. For this purpose, we used two text similarity metrics. Since there is no single metric that can fully capture all aspects of text similarity, we applied two distinct approaches: single-character differences (Levenshtein distance) and word-based differences (Jaccard similarity). A Levenshtein distance close to zero or a Jaccard (or cosine) similarity near one would indicate that the final response text has changed very little from the original.
We analyzed a total of 15,098 human-authored responses and 2560 AI-generated responses, examining the roles played by the original author and the client manager in the revision process (see Table 2). The table presents an intriguing comparison between human-authored and AI-generated responses, focusing on the modifications made by company managers prior to publication: AI-generated responses were more frequently altered: 61% were revised before publication, compared to just 15% of human-written responses. This indicates that while AI-generated responses often require some refinement, they still provide a useful first draft for company managers.
Table 2. Adjustments to the pre-formulated review responses, done by the company managers

In terms of the extent of the modifications required, human-written responses typically necessitated larger adjustments in terms of the number of characters changed. The mean Levenshtein distance for human-written responses was 75.22 characters, compared to 63.29 characters for AI-generated responses. However, in comparison to the overall content, it was observed that AI-generated responses underwent a greater proportion of alterations (15% versus 10% for human-written responses). Notwithstanding these discrepancies, both categories of responses exhibited a considerable degree of retention of their original content, as evidenced by the elevated Jaccard similarity scores (0.82 for human responses and 0.80 for AI). This illustrates that although managers were more inclined to modify AI-generated responses, the ultimate versions of both AI and human responses were not notably different from their initial drafts.
6. Discussion
This study’s scientific contribution extends beyond NLP-based CFM applications. It innovatively applies control theory to design collaborative human–AI systems, with the derived solution objectives providing a solution guide for analog collaborative patterns.
6.1. Power of language automation in CFM
Current NLP solutions offer opportunities for automating the writing process and facilitating human–AI collaboration. The control theory (Doyle et al., Reference Doyle, Francis and Tannenbaum2013) provides a framework for creating the control loop without human actors and automating the process. However, achieving the required level of text quality to fully automate the response to reviews is still a challenge. The contextualization and connection of generated text to the source review are essential factors, and maintaining high-quality standards necessitates the involvement of human authors. Human authors play a vital role in considering review information, addressing it within company guidelines, expressing empathy, personalizing responses, and ensuring the correctness of the final text.
NLP models learn from existing data, and they perform well in generating responses for simple, generic reviews (Katsiuba et al., Reference Katsiuba, Kew, Dolata and Schwabe2022). These responses can be further enhanced by incorporating additional marketing venue information. It can handle tedious time-consuming reviews effectively. However, for more sophisticated reviews that contain specific content and complaints, NLP technologies come to their limits. While they can recognize content categories and provide acknowledgments (“Thank you for your nice words about our cuisine”), they cannot generate appropriate reactions. These cases require human intervention, as human authors find them more challenging but also enjoyable to respond to.
Not all components of our artifact provide equal support and effectiveness to the process. The difficulty score, for instance, does not directly benefit authors but could enhance the overall process if used beforehand. It could filter reviews for professional authors, reducing costs. Sorting reviews in this way allows for different author groups, with experienced ones handling more complex content and newcomers gaining experience and learning.
NLP technologies can handle large volumes of unstructured text but are limited in achieving the same qualitative level as human writers. By combining these two actors, we are forming a new system that can accomplish the task, provide results, and qualitatively improve the performance. The emerging human–AI collaboration transforms the familiar roles and tasks of the actors (Dellermann et al., Reference Dellermann, Ebel, Söllner and Leimeister2019), raising the question of the organization of the emerging system and the distribution of roles and tasks within it.
Implementing human–AI collaboration in the “Smart Responses” project has resulted in a significant shift in the role of the human author. Professional authors transition from content creators to proofreaders, while newcomers become more controllers. As proofreaders use the output and data to create better output, controllers focus on correcting the existing text. The difference lies in the amount of new content produced. We divided tasks between groups of human authors according to their difficulty. While professional writers could focus on complex reviews, novice writers could control the quality of responses to more straightforward reviews.
Understanding and enhancing human–machine collaboration requires examining the relationship between actors, their perceptions, expectations, and changes in roles and practices. In human–AI collaborations, we observe expectations placed on AI. The authors use implemented solutions as tools but evaluate their results partly as the results of a hypothetical team member, sometimes expecting the models to be easily trainable like human employees. However, training NLP tools is a time-consuming process, which can lead to frustration and demotivation when expectations aren’t met. On the other hand, human authors have the ability to adapt and learn faster than intelligent systems. For instance, authors may adjust to an inadequate quality assessment to improve their quality score, even intentionally incorporating false data to prevent the system from decreasing the score. This demonstrates the authors’ over-adaptability to optimize results.
Human authors often have high expectations for AI tools in our project, so they are particularly sensitive when these tools fail, especially if it directly impacts the final results. The response generator and quality check are the most critical components that elicit strong reactions from authors when they encounter errors. In contrast, sentiment and content analysis, which provide input data for other tools, do not generate the same level of disappointment. Another interesting observation is the initial trust authors place in the technology that produces the data, leading to overconfidence. Authors believe everything is under control and may not question the accuracy of the generated response until they notice significant discrepancies.
Surprisingly, the authors expressed mixed opinions about the quality of the solution. Despite the generator’s imperfections, they were satisfied with having text to work with. This underscores that NLP solutions do not need perfection; even flawed models can significantly improve the process and reduce authors’ workload.
6.2. Human–AI collaboration
The study’s findings have implications for designing human–AI teams. They challenge the traditional view of humans as dominant decision-makers and machines as subordinates (Brynjolfsson and McAfee, Reference Brynjolfsson and McAfee2017). The understanding of the CFM in terms of control theory (Doyle et al., Reference Doyle, Francis and Tannenbaum2013) indicates that humans and machines can take on tasks of different components. The research suggests a mutual exchange of control rights between human and AI agents, resulting in a more balanced distribution of responsibilities and recognition of each actor’s specific skills (Dellermann et al., Reference Dellermann, Ebel, Söllner and Leimeister2019). We claim that researchers need to explore different configurations of sensing and control to establish effective collaborations between humans and machines.
Additionally, the results reveal mutual adaptations between machines and humans in the human-machine team. Specifically, it points to several aspects which are subject to adaptation. Whereas we expected that authors’ practices would evolve due to the inclusion of AI, this research provides some insights into what makes the authors adapt their practices over time. They not only explore the tools’ abilities but also build models about the reliability and trustworthiness of the tools, which in turn changes their perspective on their own role in this configuration. This makes the human–machine team a highly malleable, dynamic system. Consequently, the human and machine agent will develop new skills over time (e.g., in our case, the authors improved their ability to work as proofreaders rather than a composer of a text). This implies that the researchers and designers should not only think in terms of complementary skills when designing hybrid intelligence systems (Dellermann et al., Reference Dellermann, Ebel, Söllner and Leimeister2019; Seeber et al., Reference Seeber, Bittner, Briggs, DeVreede, DeVreede, Elkins, Maier, Merz, Oeste-Reiß and Randrup2020) but also try to envision possible adaptations to the skillset of the participating agents. Rather than designing for current skills, they should design for future skills. For instance, they can include components to improve the desired skills.
The fact that the human skillsets can adapt much faster and more flexibly than the skills of AI poses a great challenge in this regard, especially given the fact that the state-of-the-art AI is limited to predefined, specifiable tasks with clear targets and reward schemas. To allow the human-machine team to explore and settle on the optimal distribution of skills and responsibilities, researchers need to provide more flexible and dynamic adaptations on the side of AI. If the pace of change and mutual adaptations is not synchronized, the system imposes significant challenges on humans. If the AI component changes its capabilities after the human agent has settled on a set of practices and responsibilities, they find complementary to what AI can do, they will remain in constant search for the right procedures and practices without knowing if the AI agent will not throw them over shortly after that. Only if the exploration happens at the same pace both agents can achieve a productive configuration over time.
6.3. CFM policy recommendations
This study proposes a process for the collaboration between humans and AI in the field of customer management that not only emphasizes the significant prospects for productive collaboration on a range of tasks but also identifies potential challenges and issues that may affect the CFM field as a whole. It is therefore important to consider these issues, as they may require policy adjustments not only in CFM but also in other fields. The advent of smart tools is transforming the manner in which organizations engage with their customers, engendering notable efficiencies while simultaneously presenting a range of challenges. In light of recent technological advancements, it is imperative to examine the implications these have for customers, hospitality businesses and external CFM providers. In the following, we discuss those aspects while proposing catalogue of recommendations for the usage of AI in CFM.
The deployment of cutting-edge NLP technologies enables organizations to distinguish between customer reviews that are more straightforward to process and those that are more complex. The advent of automated systems has enabled the analysis of the content of reviews, allowing the identification of mood, tone of voice, and linguistic complexity. This has the potential to facilitate more efficient prioritization and categorization of reviews. From a policy perspective, it raises the question of how organizations can categorize feedback and whether some reviews are unintentionally receiving less attention due to their perceived difficulty. The establishment of a transparent process is essential to ensure that all reviews, regardless of their complexity, are treated with the same level of care:
Information quality: CFM should provide complete, up-to-date, relevant, and personalized information when responding to customer reviews. This guarantees that all reviews receive consistent and high-quality responses that accurately address the customer’s feedback. All customer reviews, regardless of their complexity or content, sentiment, should be treated with the same care and professionalism. Every review, whether positive or negative, simple or detailed, deserves a thoughtful, respectful, and consistent response. This approach ensures fairness, reinforces the integrity of the brand, and shows all customers that their feedback is valued equally.
In the original FMP system, information quality relied on collaboration between various actors (administrators and authors) and the system itself. Stakeholders were responsible for studying available data and staying informed about any changes in venue information. During the first iteration, we improved this process by structuring the available information. With the integration of NLP technology, relevant information can now be prioritized and presented to authors, who receive regular updates. This enhancement ensures that the responses in line with CFM Policy Recommendation are more relevant and based on up-to-date information.
Process efficiency: The process should produce responses in a timely and efficient manner. Efficiency should not only be look at the text production process itself, but also providing input material in a way that enables efficient processing. Furthermore, the author should be provided with information that allows him to control his efforts, e.g. information on the required effort due to the complexity of the customer feedback.
Initially, process efficiency at FMP was achieved through effective organization. Challenges such as tagging, personalizing responses, and ensuring quality were addressed in the first iteration by automating review content analysis, preformulating responses, and offering quality improvement recommendations. With recent advancements in language processing, the system now generates responses and analyzes text content and sentiment, further enhancing the efficiency of the process. Authors can now save time by focusing on key aspects of a review, allowing them to craft more thoughtful and higher-quality responses. The second evaluation iteration reveals that AI is now actively used in the process, and its output often forms the basis for the final response. This shift emphasizes the importance of transparency in AI involvement.
Transparency: Customer feedback management should be transparent on the quality of the response and the performance of the authors internally. Internally and externally they should carefully evaluate transparency on the participation and contribution of human and AI agents. Companies may want to clearly disclose when AI systems are used to analyze, generate, edit, or control the quality of responses to customer feedback. Companies may want to clearly specify whether responses were fully written by humans, generated by AI, or the extent of contribution from each.
At the beginning of this study, transparency was built into the process but constrained by the limitations of online platforms. Administrators were unable to provide feedback to authors on their performance. We addressed this gap in the first iteration with the introduction of a quality scoring system. Recent developments in language processing have gone further, enabling real-time feedback on each review written by an author. The second iteration also tackled the issue of authorship transparency, allowing for clearer attribution of human and AI contributions.
As AI becomes more involved in content creation, the issue of authorship attribution takes on greater significance. Our findings indicate that, despite the fact that humans frequently monitor the final quality of responses, the majority of content may still originate from AI systems. The lack of clarity regarding authorship may have ethical and reputational consequences for organizations. As AI continues to mature, it is crucial to develop guidelines that explicitly determine how and when AI involvement should be disclosed to customers. One potential framework could oblige companies to explicitly state when responses have been generated by AI, akin to how some platforms currently disclose when reviews have been sponsored or flagged for potential bias. This would improve transparency and foster trust between companies and their customers.
Quality responsibility: CFM must assign clear responsibility for the quality of responses, whether internally or via third-party providers. This includes both the quality of the initial response and the quality control by other actors.
In the previous FMP system, quality responsibility was divided between authors and administrators, with the final accountability resting with the hospitality company. While this division of responsibility remains, advancements in language processing have introduced new dynamics. Response quality can now be quantified, making it possible to shift quality responsibility either to the human author or to the hospitality company manager, depending on the situation. This flexibility allows for more tailored quality assurance based on who is overseeing the review process.
The growing trend of outsourcing CFM to external providers has been driven by the substantial time and resources this process demands. However, the rise of AI tools capable of generating and evaluating responses is reshaping this dynamic. The need for human authorship in response creation is diminishing, and AI systems are increasingly taking over quality control tasks. This shift is significantly impacting the business models of external CFM providers, whose core services—human expertise in crafting and checking responses—can now be performed in-house with AI assistance. As companies regain control over their CFM processes, many may opt to handle customer feedback internally, supported by AI, raising questions about the future relevance and role of third-party providers. Furthermore, transparency in the authorship of responses, whether by internal staff or external providers, remains essential for maintaining trust and accountability with customers.
The influence of AI extends beyond the confines of the response process. In certain instances, artificial intelligence can be employed not only to generate responses, but also to write reviews. This creates an even more complex environment, in which the authenticity of both reviews and responses can be questioned. It is imperative to establish clear guidelines on authorship and disclosure of the role of AI in the customer feedback loop to prevent a scenario where machines simply talk to machines, that is AI-generated reviews receiving AI-generated responses.
In light of these considerations, it is essential to differentiate between authentic, human-generated reviews/responses and those generated by AI systems. Labeling content as AI-generated or human-written can ensure that customers are dealing with genuine experiences and not fake interactions, thereby preserving the integrity of the review system.
7. Conclusion and future work
This research explores the use of NLP technologies in supporting the process of responding to online reviews. While full automation of the responding process is not yet achievable, NLP can effectively analyze reviews and assist in formulating responses. However, improvements are needed to ensure precision and semantic connection to the reviews, which may require human author intervention. Nevertheless, there is significant potential for automation and author support. Our study provides transferable design knowledge that can serve as a foundation for the development of new CFM products. We describe solution objectives and architecture for intelligent co-writing within CFM using AI support. The findings have practical implications for CFM providers, enabling them to provide better and faster responses to guests by leveraging the collaboration between humans and AI.
The utilization of AI in CFM has given rise to a range of policy considerations that demand attention from firms, regulators, and third-party CFM providers alike. These include the clarification of authorship and transparency in AI participation, as well as the adaptation of the CFM industry’s business model. The proactive addressing of these issues will enable organizations to leverage the potential of AI while maintaining accountability and trust in their client relationships.
Like many studies of early design research, this study is not without limitations, common to early design research. In the next steps, we want to improve the external validity of the system by increasing the number of system users and extending the language support beyond German and English. To evaluate the potential of AI and NLP in CFM, it is essential to expand the field beyond the hospitality industry. Moreover, we also plan to thoroughly analyze response quality through the involvement of random review readers and potential guests. The authors’ practices will be further validated through extensive quantitative analysis.
Future research should focus on the implementation and validation of the proposed recommendations, ensuring their applicability across different stakeholders. This includes empirical studies assessing the impact of transparency on customer trust and engagement, as well as experiments on the process efficiency of the suggested CFM process. Another important direction is the exploration of response quality and how responsibility for response quality can be effectively allocated between AI tools and human actors. By addressing these research directions, future work can build a more comprehensive understanding of AI’s role in CFM, ensuring both technological advancements and responsible AI integration.
Data availability statement
This study was conducted in collaboration with a private company, and the dataset used contains sensitive customer information. Due to the proprietary nature of the data and privacy concerns, the dataset cannot be made openly accessible to the public.
Acknowledgments
This study is a collaborative effort of re:spondelligent GmbH (referred to as Feedback Management Provider (FMP) in the text), Welante AG, the Department of Computational Linguistics, and the Department of Informatics at the University of Zurich. We thank all project members for their feedback and involvement during the development and evaluation of the solution. We also thank the anonymous participants of our study, as well as the review team for their valuable advice concerning this article.
Author contribution
Conceptualization: D.K., M.D., G.S.; Data curation: D.K.; Formal analysis: D.K.; Funding acquisition: M.D., G.S.; Investigation: D.K., M.D., G.S.; Methodology: D.K., M.D., G.S.; Project administration: D.K., M.D., G.S.; Resources: D.K., M.D., G.S.; Software: D.K., M.D.; Supervision: M.D., G.S.; Validation: D.K.; Visualization: D.K.; Writing – original draft: D.K.; Writing – review and editing: D.K., M.D., G.S.
Funding statement
This study is part of the ReAdvisor innovation project, funded by the Swiss Innovation Agency Innosuisse (project number 38943.1 IP-ICT).
Competing interest
The authors declare none.
Ethical standard
This research project was reviewed and approved by the Human Subjects Committee of the University of Zurich. All elements of the project comply with the ethical guidelines established by the committee. The research adheres to legal and ethical standards, ensuring the protection of participants’ rights and privacy throughout the study.
Comments
No Comments have been published for this article.