Impact Statement
This research work used the FPC and prominent clustering algorithms such for clustering stock market data. The analysis process involves extracting the stock market data of these companies over the past two years and these companies are categorized based on profitability percentage and short-term and long-term price fluctuations. The results demonstrate that profitability, stability, and loss within short-term and long-term periods will be placed within their respective clusters.
1. Introduction
Financial markets play a crucial role in developing economic growth. Temporal data analysis in financial markets, such as changes in stock values, assists investors and financial institutions guide their capital toward optimal returns. In simple terms, transactions can be defined as any buying and selling activity of company shares based on their fluctuations, aiming to achieve maximum returns. Each trader requires a sound understanding of market behavior, company performance, and information regarding investment strategies to perform intelligent transactions.
To achieve positive returns, traders must make informed decisions regarding the stocks they trade. Aside from price and liquidity, volatility is among the most critical factors in stock selection. The term “volatility” refers to the upward and downward movements in stock prices. Stocks with higher volatility imply that their prices undergo significant changes in either direction within a short interval. Stocks with minimal price fluctuations are less likely to experience significant increases or declines on any given day. Factors such as trading volume, news and financial reports, political events, economic conditions, and investor sentiments also impact stock prices. The efficient market hypothesis (EMH) posits that the current stock values do not entirely reflect all available information in the market at any given moment, making it difficult for individuals to make successful trades based on this information. However, some argue that markets are inefficient (partly due to psychological factors of different market participants, coupled with market inefficiency in immediately responding to newly released information) (Jensen, Reference Jensen1978). Therefore, it is believed that financial variables such as stock prices, stock market index values, and prices of financial derivatives are predictable. This allows individuals to achieve returns higher than the market average by analyzing publicly available information, presenting better returns than random chance. Stock markets are influenced by numerous interrelated factors, including (Lo and MacKinlay, Reference Lo and MacKinlay1988)
-
1) Economic variables: Interest rates, exchange rates, monetary growth rates, commodity prices, and overall economic conditions.
-
2) Industry-specific variables: Industrial production growth rates and consumer price indices.
-
3) Company-specific variables: Such as changes in corporate policies, earnings reports, and stock returns.
-
4) Psychological variables of investors: Including investor expectations and institutional investor choices.
-
5) Political variables: Such as significant political events and their dissemination (Enke and Thawornwong, Reference Enke and Thawornwong2005; Wang et al., Reference Wang, Wang, Zhang and Guo2011).
Each of these factors interacts in a highly complex manner (Yao et al., Reference Yao, Tan and Poh1999; Kang et al., Reference Kang, de Gracia and Ratti2017; Antonakakis et al., Reference Antonakakis, Cunado, Filis, Gabauer and de Gracia2023). Previous studies have also concluded that the stock market is fundamentally dynamic, nonlinear, nonstationary, nonparametric, noisy, and turbulent (Deboeck, Reference Deboeck1994; Abu-Mostafa and Atiya, Reference Abu-Mostafa and Atiya1996). Nonetheless, by enhancing trading strategies based on more accurate predictions of financial variables, investors hope to achieve significant profits due to potential market inefficiencies. Traditionally, stocks with higher volatility are perceived as riskier; however, volatile traders often seek highly fluctuating stocks, anticipating higher returns. Therefore, analyzing stock market movements remains challenging and fascinating for investors and researchers.
This study aims to achieve significant investment results and predict returns and price fluctuations by analyzing the Tehran Stock Exchange data. The study focuses on cement companies listed on the Tehran Stock Exchange. The analysis begins by extracting these companies’ stock market data over the past two years. Subsequently, these companies are categorized based on profitability percentage and short-term and long-term price fluctuations. Renowned clustering algorithms in this domain are utilized to perform this categorization. Throughout this process, these algorithms are compared to the algorithm presented in our previous paper (FPC clustering algorithm). Later, the results of these clustering analyses will be compared against each other using standard and recognized evaluation criteria to assess the quality of the clustering analysis. Finally, the results of these clusterings will be examined, summarized, and concluded. Following this, stocks will be evaluated based on returns and price fluctuations using these results, and the findings will be presented. The study is organized as follows:
The subsequent section addresses the studies carried out in data and stock market clustering, presenting the undertaken research in each domain. Section 3 is dedicated to the required data for examination and analysis, outlining the extraction methodology of this data within a two-year time frame (from April 2021 to April 2023). Section 4 comprises the introduction of evaluation criteria and the comparison of clustering algorithms under consideration. Additionally, Section 5 includes empirical results of clustering Tehran Stock Exchange data for cement companies and compares these results with the algorithms discussed in this study. Finally, the concluding section consists of a summary and conclusions from the study.
2. The previous studies
In this section, we examined the studies conducted in two areas: data clustering and clustering of stock market companies. Given the study’s focus on clustering active stock companies within the cement sector, this section aims to present the studies carried out in the domain of clustering algorithms in the first part. In the second part, it presents the studies conducted in the domain of stock market company clustering.
2.1. Research in the field of clustering algorithms
One of the path-based clustering methods involves using the minimum spanning tree (MST), initially presented by Zahn (Reference Zahn1971). In this approach, a weighted graph MST is first created, then removing incompatible edges. Ideally, these incompatible edges are the most extended in the graph. However, this assumption is often violated, and eliminating such edges does not provide suitable clusters. In recent years, extensive research has focused on introducing a metric function for identifying these incompatible edges (Wang et al., Reference Wang, Wang, Chen and Wilkes2013; Wu et al., Reference Wu, Li, Jiao, Wang and Sun2013; Pirim et al., Reference Pirim, Eksioglu and Perkins2015). Moreover, the computational complexity of the proposed function is a serious consideration. Some of these algorithms have specifically addressed the computational complexity in designing this decision tree (Jothi et al., Reference Jothi, Mohanty and Ojha2018; Wang et al., Reference Wang, Wang and Wilkes2009). Additionally, Laszlo and Mukherjee (Reference Laszlo and Mukherjee2005) have defined a limitation for cluster size, considering a temporal edge as a link between clusters, aiming to create two clusters larger than a specified threshold upon removal. To mitigate sensitivity to scattered data points in clustering, Zhong et al. proposed a two-stage MST-based clustering algorithm (Zhong et al., Reference Zhong, Miao and Wang2010). However, the algorithm they presented exhibits high complexity due to its requirement for numerous parameters. Subsequently, Zhong et al. introduced a hierarchical clustering algorithm called split-and-merge (Zhong et al., Reference Zhong, Miao and Fränti2011). This approach uses MST to guide the separation and merging process. The issue with this algorithm lies in its high time complexity, as achieving optimized computations is crucial, particularly in clustering large databases. Therefore, Wang et al. introduced an MST-inspired fast clustering algorithm that operates with a computational complexity of $ \mathcal{O}\left({n}^2\right) $ .
Alongside the algorithms above that perform clustering by eliminating incompatible edges from the MST, other algorithms exist that utilize this MST to compute dissimilarity between pairs of data, such as the minimax distance proposed by Fischer and Buhmann (Reference Fischer and Buhmann2003), defined as follows:
However, their algorithm cannot provide an optimal value for all datasets. One of the significant challenges in this field is how to partition these objects based on the similarity matrix because the optimization problem for these types of clustering falls into the NP-hard category.
Meanwhile, the K-means algorithm remains one of the most widely used algorithms in clustering (Jain, Reference Jain2010). Its popularity stems from its efficiency, simplicity, and acceptable results in practical applications. However, K-means performs well only in clustering compact and Gaussian-shaped clusters, struggling with elongated clusters or those distributed nonlinearly (Dhillon et al., Reference Dhillon, Guan and Kulis2004). To address this issue, the kernel K-means function was introduced (Von Luxburg, Reference Von Luxburg2007). This method projects data into a higher-dimensional feature space defined by a nonlinear function to facilitate linear separation of the data. However, selecting an appropriate function and its parameters for clustering a dataset can be challenging.
A novel approach in clustering is using spectral clustering algorithms, which cluster data by obtaining a linear matrix of eigenvectors from the input data. Results from this method have shown that spectral clustering does not face the issues that traditional clustering algorithms like K-means encounter (Von Luxburg, Reference Von Luxburg2007). Nevertheless, spectral clustering might confront the user with many adjustable choices, such as similarity metrics, its parameters, the type of graph Laplacian matrix, and the number of eigenvectors used (Von Luxburg, Reference Von Luxburg2007). Unfortunately, the success rate of spectral clustering heavily relies on these choices, making it challenging for users to utilize this algorithm effectively.
Chang and Yeung proposed a combined algorithm utilizing path-based and spectral clustering simultaneously, referred to as RPSC (Chang and Yeung, Reference Chang and Yeung2008). However, their algorithm has a very high computational complexity of $ \mathcal{O}\left({n}^3\right) $ due to the need for calculating eigenvector values.
On the other hand, Ester et al. introduced a density-based clustering algorithm called DBSCAN (Ester et al., Reference Ester, Kriegel, Sander and Xu1996), which can identify clusters with arbitrary shapes. This algorithm can cluster datasets in $ \mathcal{O}\left({n}^2\right) $ time. Nonetheless, one of its challenges is requiring the user to specify several input parameters. Finding suitable parameters in some datasets can be exceedingly difficult.
Rodriguez and Laio (Reference Rodriguez and Laio2014) introduced an algorithm named find density peaks clustering (FDPC), capable of clustering data with a time complexity of $ \mathcal{O}\left({n}^2\right) $ . This method aims to identify cluster centers based on point density. Points with higher density than neighboring points and relatively high density but at a significant distance can be considered potential cluster candidates. This algorithm takes a parameter, denoted as $ {d}_{\mathrm{c}} $ , as input, which determines the number of clusters using a decision tree and performs the clustering accordingly. However, FDPC may not efficiently cluster datasets with unpredictable distributions and can only identify clusters with a distinct center.
Spectral clustering through normalized cut spectral clustering (NCut) is one of the clustering algorithms that perform exceptionally well in certain types of clusters, including nonspherical and elongated clusters (Ng et al., Reference Ng, Jordan and Weiss2002; Zang et al., Reference Zang, Jiang and Ren2017). Its accuracy relies on the affinity matrix. Most spectral clustering algorithms utilize a Gaussian kernel function to estimate similarity. However, determining an optimal value for the $ \unicode{x03C3} $ parameter in the Gaussian kernel function might pose a challenge for users.
One of the fastest clustering algorithms based on MSTs among the traditional algorithms is the FAST-MST algorithm in this field, with a time complexity of $ \mathcal{O}\left({n}^{3/2}\;\log (n)\right) $ (Jothi et al., Reference Jothi, Mohanty and Ojha2018). However, this algorithm is susceptible to noise, as it considers an outlier, a point significantly distant from other points, as a separate cluster. This issue is one of the most recognized problems in MST-based clustering algorithms.
Another path-based clustering algorithm is the IPC algorithm introduced by Liu et al. (Reference Liu, Zhang, Hu, Wang, Wang and Zhao2019). Utilizing Euclidean distance for distance feature extraction among elements, this algorithm incorporates MSTs and minimax distance to derive a global optimum value with time complexity of $ \mathcal{O}\left({n}^2\right) $ .
An additional path-based clustering algorithm is global optimum path-based clustering (GOPC) (Liu and Zhang, Reference Liu and Zhang2019). This algorithm requires only one input parameter (the number of clusters) for execution. This parameter can be specified by the user or estimated by the algorithm. The time complexity of this algorithm is faster than $ \mathcal{O}\left(k\times {n}^2\right) $ , indicating a quicker execution than RPSC.
In Safari-Monjeghtapeh and Esmaeilpour (Reference Safari-Monjeghtapeh and Esmaeilpour2024), an algorithm called path-based clustering (FPC) has been introduced, capable of conducting the clustering operation using MSTs and the minimax distance while presenting a hybrid metaheuristic algorithm that requires only a few iterations for execution. This algorithm can achieve a general optimum value with a computational complexity of $ \mathcal{O}\left({n}^2\right). $ FPC demonstrates the ability to handle acceptable clustering of datasets with various shapes, sizes, and densities and exhibits resilience against noise and outliers. Furthermore, FPC requires only one parameter, the number of clusters, for executing the clustering process. A comparative analysis of FPC with other prominent clustering algorithms in this domain reveals that FPC showcases better, more stable, and faster performance than the different algorithms available in this field.
Considering the algorithms above in this field and the clustering results presented in Safari-Monjeghtapeh and Esmaeilpour (Reference Safari-Monjeghtapeh and Esmaeilpour2024), the algorithms used in this article have been compared in Table 1 in terms of input parameters, sensitivity to initial values, time complexity, and the ability to detect the number of clusters automatically.
a FDPC determines the number of clusters using a decision tree.
2.2. The studies in the field of clustering of listed companies
Given its unpredictability, investing in the stock market is challenging. Therefore, stock fluctuation data need examination to gain insights into market trends and behaviors. Many regression analysis tools are available to guide investors in making informed trading decisions. Alongside regression techniques, classification and clustering methods are employed to identify market trends and behaviors.
Numerous studies have been conducted in recent years in stock market data analysis using clustering techniques. In most of these studies, moment-to-moment stock market data are collected, and clustering is performed using various parameters. Subsequently, the obtained clusters are compared with existing standard market data. Analyzing and clustering online stock market data using data mining tools is one of the key methods in this domain. Data mining can be defined as an analytical method designed to discover patterns within data. Data mining algorithms analyze patterns and relationships among stored data values and then attempt to apply these patterns to prove results within new subsets of data (Guha et al., Reference Guha, Rastogi and Shim2001). Tasks in this area are primarily divided into two main parts:
-
1) Prediction, which aims to predict the values of required features based on the values of other features.
-
2) Description, aiming to extract patterns such as correlations, trends, clusters, paths, and anomalies that summarize hidden relationships within the data. Factors used in this discussion include the price/earnings ratio (P/E ratio) and earnings per share (EPS) (Setty et al., Reference Setty, Rangaswamy and Suresh2010).
In previous studies, researchers have attempted to correlate the performance and outcome of various machine learning algorithms trained and adapted to trade-related information. They aimed to associate relevant stock sections where decision-making during stock calculations and analysis and presenting suitable forms of transactional data could be possible (Suganthi and Kamalakannan, Reference Suganthi and Kamalakannan2015).
Clustering aims to group data objects with similar characteristics into clusters. Joseph and Indratmo (Reference Joseph and Indratmo2013) performed stock data clustering based on similar price movement patterns using the self-organizing map (SOM) clustering algorithm (Dragut, Reference Dragut2012). They proposed an unsupervised, multi-scale data streaming algorithm that identifies trends for evolving time series based on streaming data stimuli, enabling the simulation of trading decisions during flight.
To manage a profile of actively traded stocks, a study (Rajput and Bobde, Reference Rajput and Bobde2016) investigated the selection and trading of 138 stocks from various sectors and distinct indices using the partitioning among medoid (PAM) clustering algorithm (Rajput and Bobde, Reference Rajput and Bobde2016). Furthermore, they proposed a hybrid model for predicting stock value movement by employing opinion extraction and clustering methods for forecasting the National Stock Exchange (NSE).Their approach combined sentiment analysis outputs with DENCLUE clustering to predict stock market behavior.
A proposed hybrid model is based on the value of technical indicators to present a final prediction for each stock. A study conducted in Renugadevi et al. (Reference Renugadevi, Ezhilarasie, Sujatha and Umamakeswari2016) provides investors with a short-term list of recommended stocks. Additionally, the author in Bini and Mathew (Reference Bini and Mathew2016) suggests an analysis system aiding investors in identifying more profitable companies. This system utilizes clustering and future price prediction for the identified profitable companies through retrospective techniques. The author evaluates the performance of partitioning, hierarchical, model-based, and density-based techniques using credibility indices such as C-index, Jaccard index, Rand index (RI), and Silhouette index. The clustering results are input for multiple retrospective techniques to predict future stock prices. Research conducted in Lee et al. (Reference Lee, Lin, Kao and Chen2010) indicates short-term stock price movements following the release of financial reports using the proposed hierarchical recursive K-means (HRK) algorithm. The proposed framework classifies stock time series based on similarity in price trends. HRK is compared to a support vector machine (SVM), demonstrating better accuracy and average profit in the predictive model. In Goswami et al. (Reference Goswami, Bhensdadia and Ganatra2009), a candlestick analysis-based prediction model is utilized to forecast short-term stock price fluctuations. Moreover, clustering has been employed as a preprocessing step for stock market prediction (Patil and Joshi, Reference Patil and Joshi2020).
Portfolio management for Indian stock data has been conducted using self-organizing maps (SOM) and K-means clustering (Nanda et al., Reference Nanda, Mahanty and Tiwari2010). The aim is to categorize based on categories controlled by the human brain using SOM. The SOM consists of neurons forming a two-dimensional structure with neighboring relationships between neuron pairs. Parameters in SOM are divided into short term (usually one day to 30 days) and long term (usually three months to 1 year). Factors utilized include price/earnings (P/E) ratio, price/book value (P/BV), price/cash EPS (P/CEPS), EV/EBIDTA, and market cap/sales. In this discussion, the performance indices of the K-means algorithm, SOM, and fuzzy C-means have been calculated and analyzed.
In Aghabozorgi and Teh (Reference Aghabozorgi and Teh2014), a novel three-phase clustering model has been proposed for categorizing companies based on similarities in their stock market shapes. Initially, low-resolution time series data are utilized for an approximate categorization of companies. Then, in the second phase, pre-clustered companies are further divided into several pure subclusters. Finally, the subsidiary clusters are merged in the third phase.
In Alemohammad (Reference Alemohammad2019), phasic clustering based on partitioning around medoids is employed to partition financial time series. The GARCH parameterization approach measures the similarity or dissimilarity between the time series. This method is applied in Alemohammad (Reference Alemohammad2019) to cluster some major Asia-Pacific stock markets.
Furthermore, in Zhong and Enke (Reference Zhong and Enke2017), a comprehensive data mining process has been outlined to predict the daily return value of the S&P 500 Index ETF (SPY) based on 60 financial and economic features.
In Lúcio and Caiado (Reference Lúcio and Caiado2022), the impact of COVID-19 on certain S&P 500 industries is assessed using a novel clustering method based on feature and TGARCH modeling. Instead of utilizing model-estimated parameters to calculate a distance matrix for stock indices, the approach employs a distance estimation based on autocorrelations of the estimated conditional volatilities. Both hierarchical and nonhierarchical algorithms are utilized to assign a set of industries to clusters.
3. Description and preprocessing of research data
The data for this research were obtained from the Tehran Stock Exchange Technology Management Company website (http://www.tsetmc.com). These datasets encompass companies related to the cement industry and their factories. Initially, the data for cement companies listed on the Tehran Stock Exchange were extracted using the TseClient 2.0 software within two years, from April 2021 to April 2023. Subsequently, among these datasets, companies with invalid or missing data (due to the absence of stock market activity on the specific date being examined or having an insignificant number of trading days and invalid trading volumes) were excluded. The initial list included 85 companies, of which only 45 shares were valid and analyzable. The list of these companies can be seen in Table 2.
The data utilized for this research from each company includes the following components:
-
• Opening price (OPEN)
-
• Highest price (HIGH)
-
• Lowest price (LOW)
-
• Closing price (CLOSE)
The data interval ranges from April 2021 to April 2023, recorded daily. Initially, we processed data, removing undefined, non-numeric, and missing values from the dataset. For each data entry, all features were acquired over the 24 months under study (from April 2021 to April 2023) and examined across various s as follows:
Initially, for each day’s transactions of each listed company, the price changes were calculated using the formula below and stored in an array representing the price changes in one row:
where Co represents the stocks of the respective listed company and $ \mathrm{C}{\mathrm{o}}_{\mathrm{M}{\mathrm{o}}_{\mathrm{d}}} $ denotes the price changes of those stocks on day d. $ \mathrm{Pric}{\mathrm{e}}_{\mathrm{Clos}{\mathrm{e}}_{\mathrm{d}}} $ and $ \mathrm{Pric}{\mathrm{e}}_{\mathrm{Ope}{\mathrm{n}}_{\mathrm{d}}} $ correspond to those stocks’ closing and opening prices on day d, respectively. On the other hand, each company’s percentage change in price was computed and stored for each day using the following formula:
where Co represents the stocks of the respective listed company, $ \mathrm{C}{\mathrm{o}}_{\mathrm{PM}{\mathrm{o}}_{\mathrm{d}}} $ signifies the percentage change in the price of those stocks on day d. $ \mathrm{Pric}{\mathrm{e}}_{\mathrm{Clos}{\mathrm{e}}_{\mathrm{d}}} $ represents the closing price of those stocks, and $ \mathrm{Pric}{\mathrm{e}}_{\mathrm{Ope}{\mathrm{n}}_{\mathrm{d}}} $ denotes the opening price of those stocks on day d. These values were calculated for each listed company and stored in one array row for each day.
Based on these stock values of active companies in the cement sector, the percentage return and deviation from the average price changes for these stocks will be calculated and stored five times for the two components.
These components, denoted as $ \mathrm{C}{\mathrm{o}}_{\mathrm{Recor}{\mathrm{d}}_{\mathrm{range}}} $ , represent the percentage price changes and deviations from the average price changes within the historical range. These times consist of
-Weekly time (5 working days)
-Monthly time (21 working days)
-Three-month time (63 working days)
-Six-month time (126 working days)
-One-year time (252 working days)
Following the preprocessing of the data, information from 45 samples (cement companies) will be fed into the researched algorithms. In the subsequent section, the evaluation metrics for this data will be introduced, along with explanations of how each is calculated.
4. Evaluation criteria
The main idea in assessing clustering is to compare the intra-cluster (within-cluster) distance to the intercluster (cluster-to-cluster) distance to determine how well clusters are separated. A good clustering should have a small intra-cluster distance and a large intercluster distance. For evaluating clustering, the following metrics are introduced:
4.1. Silhouette coefficient
As one of the most widely used clustering evaluation metrics, the silhouette coefficient (S) compares intra-cluster and intercluster distances into a score ranging from −1 to 1. A value of 1 signifies an excellent clustering result, where intercluster distances are much larger than intra-cluster ones. Conversely, a value close to −1 indicates a completely wrong assignment of clusters, where intercluster distances are not comparable to intra-cluster distances (Rousseeuw, Reference Rousseeuw1987). Regarding the intra-cluster distance, for each data point “i” inside cluster “C,” “a” is defined as the average distance between “i” and other data points within “C.”
where $ \left|{C}_{\mathrm{I}}\right| $ is the number of points belonging to cluster i and $ d\left(i,j\right) $ is the distance between data points i and j in cluster $ {C}_{\mathrm{I}} $ .
Thus, for any given point i, a small score a(i) indicates a good clustering assignment for point i because it is close to the points of the same cluster. In contrast, a large a(i) score indicates bad clustering for point i because it is far from its cluster points.
For intercluster distance, for each data point i inside cluster C, b is defined as the smallest distance from i to all points in any other cluster of which i is not a member. In other words, b is the average distance between i and all points of its nearest neighbor cluster.
After obtaining the average intra-cluster and intercluster distances for each data point in the dataset, the silhouette score is defined as follows:
The silhouette score is zero in the rare situation where ∣CI∣=1 (indicating only one data point “i” in cluster “C”). In the above formula, we observe that the score is completely bounded between negative one and one, where a larger score indicates better separation between clusters. One of the most significant advantages of the silhouette score is its straightforward interpretation and clear constraints. However, its major drawback lies in its computational cost. The exceptionally long execution time in a relatively large dataset makes it less practical and less useful in real-world applications.
4.2. The Calinski–Harabasz index
The Calinski–Harabasz index (CH), also known as the ratio of variances, measures the ratio of the sum of squared distances between clusters to the sum of squared distances within clusters for all clusters. This sum of squared distances is corrected by the degrees of freedom. Here, the within-cluster sum is based on the distances of data points within a cluster to its centroid, and the between-cluster sum is estimated based on the distances between the cluster centroids and the overall centroid (Calinski and Harabasz, Reference Calinski and Harabasz1974). The CH for K clusters in the dataset D is defined as follows:
In which di represents the feature vector of data point i, $ {n}_k $ denotes the size of cluster k, $ {c}_k $ is the feature vector of the centroid of cluster k, c represents the feature vector of the global centroid of the entire dataset, and N is the total number of data points. We can see that the numerator is a weighted sum (based on cluster size $ {n}_k $ ) of the squared distance from each unique cluster centroid to the overall centroid divided by the degrees of freedom. The denominator is the sum of squared distances from each data point to its cluster centroid, divided by the degrees of freedom. The degrees of freedom are used to scale these two parts onto a unified scale. From the equation above, a higher metric value indicates better separation between clusters, and there is no upper bound for metrics like the silhouette score.
4.3. The Davies–Bouldin index
The Davies–Bouldin index (DB) is similar to the CH, but it calculates the ratio of intra-cluster distance inversely to the CH index. In the computation of the DB, the concept of similarity score, which measures the similarity between two clusters, is present. It is defined as follows:
where Rij is the similarity score, Si and Sj are the average distances from points to the centroids within clusters i and j, respectively, and Mij represents the distance between centroids of cluster i and cluster j. The equation shows that a lower similarity score indicates better cluster separation, as a small number implies a small intra-cluster distance and a large denominator represents a large intercluster distance. The DB is the average similarity score across all clusters with their closest neighboring cluster (Davies and Bouldin, Reference Davies and Bouldin1979).
In this equation, Di represents the worst (largest) similarity score of cluster one among all other clusters, and the final DB index is the average of clusters Di among N clusters and N is the total number of data points.
A smaller DB index signifies better cluster separation. However, this metric shares a weakness similar to the CH index, as it doesn’t perform well in managing shape-agnostic clustering methods (such as density-based clustering). Nevertheless, CH and DB indices are much quicker to compute than the silhouette score.
4.4. The RI and adjusted Rand index
The RI and adjusted Rand index (ARI) are metrics commonly used to compare clustering results with external criteria. Calculating these indices is somewhat more complex than others, involving a formula that counts pairwise combinations. For a given number of N cases, the total number of pairs that can be created disregarding the order is denoted by C(N,2) and calculated as follows:
Please note that both C(1,2) = 0 and C(0,2) = 0.
In the contingency matrix, $ {V}_{ij} $ represents the count of elements present in the intersection between class partition $ {\mathrm{A}}_i $ from partition A and class partition $ {\mathrm{B}}_j $ from partition B.
When $ {V}_{ij}>0 $ , it indicates overlap between partitions. The sum of each row $ {S}_{i\ast } $ equals the count of elements in the class partition $ {\mathrm{A}}_i $ , and the sum of each column $ {S}_{\ast j} $ equals the count of elements in the class partition $ {\mathrm{B}}_j $ . The total sum $ {S}_{mn} $ equals the count of elements in set S.
The calculation of the RI and the ARI can be expressed in four values x, y, z, and w, which are defined as follows:
The parameters $ x,y,z,\mathrm{and}\;w $ are calculated based on the formulas 13–16. Since we do not know the correct category of each company in the clustering of companies, we only use the first three criteria of this section, that is, Silhouette, Kalinsky–Harabasz, and Davis–Bouldin, which are respectively the maximum of the first two criteria and the minimum of the third criterion. In the next section, the method of clustering and the results of each will be presented (Rand, Reference Rand1971; Hubert and Arabie, Reference Hubert and Arabie1985).
5. Experimental results and comparison with other algorithms
In this section, the stock values of listed companies active in cement in two short-term and long-term intervals will be clustered and analyzed for the two components of yield percentage and deviation from the average price changes. In this study, we use K-means, GOPC, IPC, and FDPC algorithms to compare FPC algorithm clustering, a prominent algorithm in data clustering. In this way, the clustering results of the FPC algorithm and the compared algorithms will be reviewed and compared based on the clustering evaluation criteria. We examine each listed company in terms of two percentages of price changes and deviation from the average price changes so the clustering results can be displayed in two dimensions.
5.1. Review of listed companies in the short term
This section will focus on clustering cement company stocks within short-term times, specifically weekly and monthly. We first need to determine the number of clusters (K) for the clustering process to initiate this process. Since we lack prior knowledge about this value, we’ll employ the K-means algorithm with a variable number of clusters ranging from 2 to 20. Subsequently, the silhouette metric will be calculated for each case. The optimal cluster count will be determined based on the cluster with the highest silhouette score. Considering the stochastic nature of the K-means algorithm, this process will be executed 30 times, and the most prevalent value among the outcomes will serve as the reference for determining the cluster count (K) for further analysis.
5.1.1. Weekly interval
Diagram 1 shows the result obtained from running the algorithm on stock market data within the weekly time frame, as specified:
Diagram 1 shows that within this time frame, the optimal number of clusters for clustering stock companies within the cement sector is two. Following this stage, we execute the FPC and rival algorithms considering the obtained K for the weekly in equation (4). Figure 1 illustrates the clustering outcomes of stock market companies within the weekly using the compared algorithms.
Based on the results of Figure 1, it can be said that the points in the bottom-left represent stock companies with higher negative profit percentages and lower price fluctuations in the weekly interval compared to other companies. Similarly, the points in the top-right indicate companies with higher positive profit percentages and more significant price fluctuations in the weekly interval than other companies. The evaluation results of three different clustering metrics are presented in Table 3.
The results show that the FPC method, similar to the K-means algorithm, provides better clustering in the first two metrics than other algorithms. In the third metric, the FDPC algorithm has the lowest value. However, given that its values in the first two metrics are considerably lower than those of other algorithms, its DB index value becomes unreliable. Consequently, both the K-means and FPC methods exhibit better performance in this experiment, indicating the creation of higher-quality clusters.
The results show that the stock companies related to the cement industry in the short-term weekly interval can be divided into two categories. Each of these categories has a mean or cluster center, meaning each cluster resembles a company, or in other words, a company represents that cluster, as presented in Table 4.
Considering the results in Table 4 and Figure 1, it can be observed that focusing on companies’ stocks in Cluster 2 could be beneficial for short-term and daily investments. This is because, within this cluster, all stock companies demonstrate positive returns, suitable for short-term investment, specifically weekly. The investigated companies list included 85 companies, of which only 45 shares were valid and analyzable.
5.1.2. One-month Interval
Diagram 2 shows the results obtained from running the algorithm on stock market data for the one-month interval, with the specified attributes:
From the above figure, it’s evident that seven clusters appear to be the most suitable for clustering cement companies in the stock market for this time frame. Following this stage, we execute the FPC algorithm and its competitor algorithms with the obtained K value, considering the one-month interval in equation (4). Figure 2 illustrates the clustering results of stock market companies for the one-month interval using the compared algorithms.
Based on the results shown in Figure 2, the points in the bottom-left section represent stock market companies with higher negative profit percentages and fewer price variations within the one-month interval than other companies. Similarly, the points in the top-right section indicate companies with higher positive profit percentages and more significant price changes within the one-month interval than the rest. The evaluation results of three different clustering metrics are presented in Table 5.
The outcomes show that the FPC method demonstrates better clustering in the first two metrics compared to the other algorithms. Furthermore, GOPC, IPC, and FDPC algorithms exhibit the lowest values in the third metric. However, given their substantially lower values in the first two metrics than the other algorithms, their DB metric value becomes invalid. Following these assessments, the FPC algorithm presents the most favorable outcome. Therefore, in this experiment, the FPC method generates higher-quality clusters.
The results show that cement-related stock market companies within the short-term one-month interval can be classified into seven categories. Each of these categories possesses an average or cluster center. In other words, each cluster represents a company, or in simpler terms, a company representative of that cluster, as displayed in Table 6.
Based on the findings from Table 6 and Figure 2, it can be observed that focusing on stocks belonging to clusters 1, 4, and 5 would be advisable for short-term investments within the one-month interval. This recommendation stems from the fact that in these clusters, all stock market companies demonstrate positive returns and are suitable for short-term investments within one month. The companies falling into clusters 3 and 6 maintain a balanced outcome, signifying relatively moderate profits and customary fluctuations within the one-month interval, positioning them in an intermediate group. Finally, companies placed in Clusters 2 and 7 are not recommended for one-month investment due to their specific characteristics.
5.2. Long-term review of the listed companies
In this section, we will focus on clustering the stocks of cement companies in the long term, spanning over three-month, six-month, and one-year time frames. To conduct this analysis, we must determine the number of clusters (K) for the clustering process. Since the optimal number of clusters is unknown, we will employ the K-means algorithm with varying clusters from 2 to 20, calculating the silhouette score for each case. The higher silhouette score will determine the optimal number of clusters. As the K-means algorithm possesses a random structure, we will repeat this process 30 times, considering the predominant value among the generated scenarios as the reference for determining the number of clusters (K) for further analysis.
5.2.1. Three-month interval
Diagram 3 shows the result of running the algorithm on the stock market data in an interval of three months with the mentioned specifications.
It can be seen from the above picture that in this interval of time, the number of four clusters is the best number for the clustering of cement companies. After this step, we run the FPC and competing algorithms with the number K obtained and consider the three-month in equation (4). Figure 3 shows the results of the clustering of stock companies in three months with the compared algorithms.
According to the results of Figure 3, it can be said that the points that are in the lower left part represent the listed companies with a higher percentage of negative profit and fewer price changes in the three-month interval than other companies, and in the same way, the points that are in the upper right part of the picture represent the companies that have more positive profit percentage and more price changes in the three-month interval compared to other companies. The evaluation results of three different clustering criteria are shown in Table 7.
As can be seen from the results, the FPC method achieves better clustering in the first two criteria than other algorithms. In the third criterion, IPC and GOPC algorithms have the lowest value. Still, considering that their values are lower than those of different algorithms in the first two criteria, it can be concluded that the FPC method creates higher-quality clusters in this test.
By analyzing the results, we conclude that the listed companies related to the cement industry in the long term can be divided into four categories in three months. Each of these categories has a mean or cluster center. That is, each cluster is similar to a company, or in other words, a company is a representative of that cluster, which is shown in Table 8.
Given the results in Table 8 and Figure 3, it can be said that if you want to invest in the interval of three months, you can focus on the shares of the companies that are in Cluster 3 because, in this cluster, all the companies in stock markets have positive returns and are suitable for quarterly investments. The listed companies that are in Clusters 1 and 2 have balanced results. They have relative profitability and conventional changes in the three-month interval and are in the middle group. Finally, the companies in Cluster 4 are not recommended for quarterly investment.
5.2.2. Six-month interval
Diagram 4 shows the result of running the algorithm on the stock market data in a six-month interval with the mentioned specifications
It can be seen from the above image that in this interval of time, the number of three clusters is the best number for the clustering of listed cement companies. After this step, we run the FPC algorithm and competing algorithms with the obtained number of K, considering the 126-day in equation (4) of the algorithms. Figure 4 shows the results of the clustering of stock companies in a six-month interval with the compared algorithms.
According to the results of Figure 4, it can be said that the points in the lower left part represent the listed companies with a higher negative profit percentage and fewer price changes in the six-month interval than other companies. Similarly, the points in the upper right of the picture represent companies with more positive profit percentages and more price changes in the six-month interval compared to other companies. The evaluation results of three different clustering criteria are given in Table 9.
As can be seen from the results of this section, the FPC and K-means methods achieve better clustering in all three criteria than other algorithms. After analyzing the results of this section, it can be concluded that in this experiment, FPC and K-means create better quality clusters.
By analyzing the results, we conclude that the listed companies related to the cement industry in the long term can be divided into three categories in six months. Each of these categories has a mean or cluster center. That is, each cluster is similar to a company, or in other words, a company is a representative of that cluster, which is shown in Table 10.
According to the results of Table 10 and Figure 4, it can be said that if you want to make a long-term investment in six months, you can focus on the shares of companies that are in Cluster 1 because, in this cluster, all stock companies have positive returns and are suitable for quarterly investments.
Stock exchange companies that are in Cluster 2 have balanced results. They have relative profitability and conventional changes in the three-month interval and are in the middle group. Finally, the companies in Cluster 3 are not recommended for quarterly investment.
5.2.3. One-year interval
Diagram 5 shows the result of running the algorithm on the stock market data in a one-year interval with the mentioned specifications.
It can be seen from the above image that in this interval of time, the number of three clusters is the best number for the clustering of listed cement companies. After this step, we run the FPC algorithm and competing algorithms with the number of K obtained, considering the one-year in equation (4) of the algorithms. Figure 5 shows the results of the clustering of listed companies in a one-year interval with the compared algorithms.
The factors affecting the number of companies are obtained from the Tehran Stock Exchange Technology Management Company. These datasets encompass companies related to the cement industry and their factories within two years, from April 2021 to April 2023. According to the results of Figure 5, it can be said that the points that are in the lower left part represent the listed companies with a higher negative profit percentage and fewer price changes in a one-year interval than other companies, and similarly, the points that are in the upper right part of the picture represent companies that have more positive profit percentages and more price changes in a one-year interval than other companies. The evaluation results of three different clustering criteria are shown in Table 11.
As can be seen from the results of this section, FPC, FDPC, and K-means methods achieve better clustering in all three criteria than IPC and GOPC algorithms. After analyzing the results of this section, it can be concluded that FPC, FDPC, and K-means methods create better quality clusters in this experiment. Based on the analysis of the results, it can be deduced that cement-related companies in the stock market, over the long-term one-year interval, can be categorized into three distinct clusters. Each of these clusters possesses an average or central point, essentially representing a company within that cluster, as shown in Table 12.
From the findings in Table 12 and Figure 5, it can be inferred that for those interested in long-term investments within a one-year time frame, focusing on stocks associated with Cluster 2 could be prudent. This cluster includes companies exhibiting positive returns and suitable for longer-term investment strategies. Companies in Cluster 1 exhibit a balanced performance, indicating relatively moderate profitability and standard fluctuations within the three-month interval, positioning them within a middle ground. Finally, companies falling within Cluster 3 are not recommended for three-month investments.
6. Discussion
Table 13 shows the listed companies and their cluster numbers for each interval.
According to Table 13, the companies in each cluster can be checked for the time intervals evaluated in this article.
7. Conclusion
This study aimed to cluster and evaluate listed companies active in cement. In this regard, 45 cement companies were evaluated. It evaluates capital return and deviation from the average price changes in short term (weekly, one month, and three months) and long term (three months, six months, and one year). K-means, IPC, GOPC, and FDPC clustering algorithms were used to evaluate and cluster these data, and these algorithms were evaluated and compared with the FPC algorithm. The results of clustering and evaluation of clustering algorithms indicated that the FPC algorithm produced better quality clusters in all cases and showed acceptable performance in different clustering. On the other hand, by examining the profitability and price changes of listed companies, the representative listed company of each cluster was identified in each time interval.
In Tables 4, 6, 8, 10, and 12, we found which companies represent each cluster among all the companies in the stock market and active in cement. Also, suitable companies for investment were introduced. Also, companies unsuitable for investment in any specific interval were introduced. According to Table 13, these clusters can be seen for each company and considered in short-term and long-term investments.
Data availability statement
The data for this research were obtained from the Tehran Stock Exchange Technology Management Company website (https://www.khanehsiman.com, available August 24, 2024).
Author contribution
All of the authors have participated in all parts of the manuscript.
Competing interest
The authors declare no conflicts of interest to report regarding the present study.
Ethical approval
This article does not contain any studies with human participants or animals performed by authors.
Comments
No Comments have been published for this article.