Clustering of listed stock exchange companies active in the cement using the FPC clustering algorithm

Leila Safari-monjeghtapeh; Mansour Esmaeilpour

doi:10.1017/dce.2024.36

Clustering of listed stock exchange companies active in the cement using the FPC clustering algorithm

Published online by Cambridge University Press: 25 October 2024

Leila Safari-monjeghtapeh and

Mansour Esmaeilpour

Show author details

Leila Safari-monjeghtapeh: Affiliation:
Computer Engineering Department, Sanandaj Branch, Islamic Azad University, Sanandaj, Iran
Mansour Esmaeilpour*: Affiliation:
Computer Engineering Department, Hamedan Branch, Islamic Azad University, Hamedan, Iran
*: Corresponding author: Mansour Esmaeilpour; Email: [email protected]

Article contents

Abstract
Impact Statement
Introduction
The previous studies
Description and preprocessing of research data
Evaluation criteria
Experimental results and comparison with other algorithms
Discussion
Conclusion
Data availability statement
Author contribution
Competing interest
Ethical approval
References

Abstract

Data mining and techniques for analyzing big data play a crucial role in various practical fields, including financial markets. However, only a few quantitative studies have been focused on predicting daily stock market returns. The data mining methods used in previous studies are either incomplete or inefficient. This study used the FPC clustering algorithm and prominent clustering algorithms such as K-means, IPC, FDPC, and GOPC for clustering stock market data. The stock market data utilized in this study comprise data from cement companies listed on the Tehran Stock Exchange. These data concerning capital returns and price fluctuations will be examined and analyzed to guide investment decisions. The analysis process involves extracting the stock market data of these companies over the past two years. Subsequently, these companies are categorized based on two criteria: profitability percentage and short-term and long-term price fluctuations, using the FPC clustering algorithm and the classification above algorithms. Then, the results of these clustering analyses are compared against each other using standard and recognized evaluation criteria to assess the quality of the clustering analysis. The findings of this investigation indicate that the FPC algorithm provides more favorable results than other algorithms. Based on the results, companies demonstrating profitability, stability, and loss within short-term (weekly and monthly) and long-term (three-month, six-month, and one-year) time frames will be placed within their respective clusters and introduced accordingly.

Keywords

clustering FDPC algorithm FPC algorithm GOPC algorithm IPC algorithm K-means algorithm stock market prediction

Type: Research Article
Information: Data-Centric Engineering , Volume 5 , 2024 , e23

DOI: https://doi.org/10.1017/dce.2024.36 [Opens in a new window]
Creative Commons: This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright: © The Author(s), 2024. Published by Cambridge University Press

Impact Statement

This research work used the FPC and prominent clustering algorithms such for clustering stock market data. The analysis process involves extracting the stock market data of these companies over the past two years and these companies are categorized based on profitability percentage and short-term and long-term price fluctuations. The results demonstrate that profitability, stability, and loss within short-term and long-term periods will be placed within their respective clusters.

1. Introduction

Financial markets play a crucial role in developing economic growth. Temporal data analysis in financial markets, such as changes in stock values, assists investors and financial institutions guide their capital toward optimal returns. In simple terms, transactions can be defined as any buying and selling activity of company shares based on their fluctuations, aiming to achieve maximum returns. Each trader requires a sound understanding of market behavior, company performance, and information regarding investment strategies to perform intelligent transactions.

To achieve positive returns, traders must make informed decisions regarding the stocks they trade. Aside from price and liquidity, volatility is among the most critical factors in stock selection. The term “volatility” refers to the upward and downward movements in stock prices. Stocks with higher volatility imply that their prices undergo significant changes in either direction within a short interval. Stocks with minimal price fluctuations are less likely to experience significant increases or declines on any given day. Factors such as trading volume, news and financial reports, political events, economic conditions, and investor sentiments also impact stock prices. The efficient market hypothesis (EMH) posits that the current stock values do not entirely reflect all available information in the market at any given moment, making it difficult for individuals to make successful trades based on this information. However, some argue that markets are inefficient (partly due to psychological factors of different market participants, coupled with market inefficiency in immediately responding to newly released information) (Jensen, Reference Jensen1978). Therefore, it is believed that financial variables such as stock prices, stock market index values, and prices of financial derivatives are predictable. This allows individuals to achieve returns higher than the market average by analyzing publicly available information, presenting better returns than random chance. Stock markets are influenced by numerous interrelated factors, including (Lo and MacKinlay, Reference Lo and MacKinlay1988)

1) Economic variables: Interest rates, exchange rates, monetary growth rates, commodity prices, and overall economic conditions.
2) Industry-specific variables: Industrial production growth rates and consumer price indices.
3) Company-specific variables: Such as changes in corporate policies, earnings reports, and stock returns.
4) Psychological variables of investors: Including investor expectations and institutional investor choices.
5) Political variables: Such as significant political events and their dissemination (Enke and Thawornwong, Reference Enke and Thawornwong2005; Wang et al., Reference Wang, Wang, Zhang and Guo2011).

Each of these factors interacts in a highly complex manner (Yao et al., Reference Yao, Tan and Poh1999; Kang et al., Reference Kang, de Gracia and Ratti2017; Antonakakis et al., Reference Antonakakis, Cunado, Filis, Gabauer and de Gracia2023). Previous studies have also concluded that the stock market is fundamentally dynamic, nonlinear, nonstationary, nonparametric, noisy, and turbulent (Deboeck, Reference Deboeck1994; Abu-Mostafa and Atiya, Reference Abu-Mostafa and Atiya1996). Nonetheless, by enhancing trading strategies based on more accurate predictions of financial variables, investors hope to achieve significant profits due to potential market inefficiencies. Traditionally, stocks with higher volatility are perceived as riskier; however, volatile traders often seek highly fluctuating stocks, anticipating higher returns. Therefore, analyzing stock market movements remains challenging and fascinating for investors and researchers.

This study aims to achieve significant investment results and predict returns and price fluctuations by analyzing the Tehran Stock Exchange data. The study focuses on cement companies listed on the Tehran Stock Exchange. The analysis begins by extracting these companies’ stock market data over the past two years. Subsequently, these companies are categorized based on profitability percentage and short-term and long-term price fluctuations. Renowned clustering algorithms in this domain are utilized to perform this categorization. Throughout this process, these algorithms are compared to the algorithm presented in our previous paper (FPC clustering algorithm). Later, the results of these clustering analyses will be compared against each other using standard and recognized evaluation criteria to assess the quality of the clustering analysis. Finally, the results of these clusterings will be examined, summarized, and concluded. Following this, stocks will be evaluated based on returns and price fluctuations using these results, and the findings will be presented. The study is organized as follows:

The subsequent section addresses the studies carried out in data and stock market clustering, presenting the undertaken research in each domain. Section 3 is dedicated to the required data for examination and analysis, outlining the extraction methodology of this data within a two-year time frame (from April 2021 to April 2023). Section 4 comprises the introduction of evaluation criteria and the comparison of clustering algorithms under consideration. Additionally, Section 5 includes empirical results of clustering Tehran Stock Exchange data for cement companies and compares these results with the algorithms discussed in this study. Finally, the concluding section consists of a summary and conclusions from the study.

2. The previous studies

In this section, we examined the studies conducted in two areas: data clustering and clustering of stock market companies. Given the study’s focus on clustering active stock companies within the cement sector, this section aims to present the studies carried out in the domain of clustering algorithms in the first part. In the second part, it presents the studies conducted in the domain of stock market company clustering.

2.1. Research in the field of clustering algorithms

One of the path-based clustering methods involves using the minimum spanning tree (MST), initially presented by Zahn (Reference Zahn1971). In this approach, a weighted graph MST is first created, then removing incompatible edges. Ideally, these incompatible edges are the most extended in the graph. However, this assumption is often violated, and eliminating such edges does not provide suitable clusters. In recent years, extensive research has focused on introducing a metric function for identifying these incompatible edges (Wang et al., Reference Wang, Wang, Chen and Wilkes2013; Wu et al., Reference Wu, Li, Jiao, Wang and Sun2013; Pirim et al., Reference Pirim, Eksioglu and Perkins2015). Moreover, the computational complexity of the proposed function is a serious consideration. Some of these algorithms have specifically addressed the computational complexity in designing this decision tree (Jothi et al., Reference Jothi, Mohanty and Ojha2018; Wang et al., Reference Wang, Wang and Wilkes2009). Additionally, Laszlo and Mukherjee (Reference Laszlo and Mukherjee2005) have defined a limitation for cluster size, considering a temporal edge as a link between clusters, aiming to create two clusters larger than a specified threshold upon removal. To mitigate sensitivity to scattered data points in clustering, Zhong et al. proposed a two-stage MST-based clustering algorithm (Zhong et al., Reference Zhong, Miao and Wang2010). However, the algorithm they presented exhibits high complexity due to its requirement for numerous parameters. Subsequently, Zhong et al. introduced a hierarchical clustering algorithm called split-and-merge (Zhong et al., Reference Zhong, Miao and Fränti2011). This approach uses MST to guide the separation and merging process. The issue with this algorithm lies in its high time complexity, as achieving optimized computations is crucial, particularly in clustering large databases. Therefore, Wang et al. introduced an MST-inspired fast clustering algorithm that operates with a computational complexity of $ \mathcal{O}\left({n}^2\right) $ .

Alongside the algorithms above that perform clustering by eliminating incompatible edges from the MST, other algorithms exist that utilize this MST to compute dissimilarity between pairs of data, such as the minimax distance proposed by Fischer and Buhmann (Reference Fischer and Buhmann2003), defined as follows:

(1)

$$ {d}_{x,y}=\underset{p\in {P}_{x,y}}{\min}\left\{\underset{1\le h<\left|p\right|}{\max }{d}_{p\left[h\right],\hskip0.4em p\left[h+1\right]}\right\} $$

However, their algorithm cannot provide an optimal value for all datasets. One of the significant challenges in this field is how to partition these objects based on the similarity matrix because the optimization problem for these types of clustering falls into the NP-hard category.

Meanwhile, the K-means algorithm remains one of the most widely used algorithms in clustering (Jain, Reference Jain2010). Its popularity stems from its efficiency, simplicity, and acceptable results in practical applications. However, K-means performs well only in clustering compact and Gaussian-shaped clusters, struggling with elongated clusters or those distributed nonlinearly (Dhillon et al., Reference Dhillon, Guan and Kulis2004). To address this issue, the kernel K-means function was introduced (Von Luxburg, Reference Von Luxburg2007). This method projects data into a higher-dimensional feature space defined by a nonlinear function to facilitate linear separation of the data. However, selecting an appropriate function and its parameters for clustering a dataset can be challenging.

A novel approach in clustering is using spectral clustering algorithms, which cluster data by obtaining a linear matrix of eigenvectors from the input data. Results from this method have shown that spectral clustering does not face the issues that traditional clustering algorithms like K-means encounter (Von Luxburg, Reference Von Luxburg2007). Nevertheless, spectral clustering might confront the user with many adjustable choices, such as similarity metrics, its parameters, the type of graph Laplacian matrix, and the number of eigenvectors used (Von Luxburg, Reference Von Luxburg2007). Unfortunately, the success rate of spectral clustering heavily relies on these choices, making it challenging for users to utilize this algorithm effectively.

Chang and Yeung proposed a combined algorithm utilizing path-based and spectral clustering simultaneously, referred to as RPSC (Chang and Yeung, Reference Chang and Yeung2008). However, their algorithm has a very high computational complexity of $ \mathcal{O}\left({n}^3\right) $ due to the need for calculating eigenvector values.

On the other hand, Ester et al. introduced a density-based clustering algorithm called DBSCAN (Ester et al., Reference Ester, Kriegel, Sander and Xu1996), which can identify clusters with arbitrary shapes. This algorithm can cluster datasets in $ \mathcal{O}\left({n}^2\right) $ time. Nonetheless, one of its challenges is requiring the user to specify several input parameters. Finding suitable parameters in some datasets can be exceedingly difficult.

Rodriguez and Laio (Reference Rodriguez and Laio2014) introduced an algorithm named find density peaks clustering (FDPC), capable of clustering data with a time complexity of $ \mathcal{O}\left({n}^2\right) $ . This method aims to identify cluster centers based on point density. Points with higher density than neighboring points and relatively high density but at a significant distance can be considered potential cluster candidates. This algorithm takes a parameter, denoted as $ {d}_{\mathrm{c}} $ , as input, which determines the number of clusters using a decision tree and performs the clustering accordingly. However, FDPC may not efficiently cluster datasets with unpredictable distributions and can only identify clusters with a distinct center.

Spectral clustering through normalized cut spectral clustering (NCut) is one of the clustering algorithms that perform exceptionally well in certain types of clusters, including nonspherical and elongated clusters (Ng et al., Reference Ng, Jordan and Weiss2002; Zang et al., Reference Zang, Jiang and Ren2017). Its accuracy relies on the affinity matrix. Most spectral clustering algorithms utilize a Gaussian kernel function to estimate similarity. However, determining an optimal value for the $ \unicode{x03C3} $ parameter in the Gaussian kernel function might pose a challenge for users.

One of the fastest clustering algorithms based on MSTs among the traditional algorithms is the FAST-MST algorithm in this field, with a time complexity of $ \mathcal{O}\left({n}^{3/2}\;\log (n)\right) $ (Jothi et al., Reference Jothi, Mohanty and Ojha2018). However, this algorithm is susceptible to noise, as it considers an outlier, a point significantly distant from other points, as a separate cluster. This issue is one of the most recognized problems in MST-based clustering algorithms.

Another path-based clustering algorithm is the IPC algorithm introduced by Liu et al. (Reference Liu, Zhang, Hu, Wang, Wang and Zhao2019). Utilizing Euclidean distance for distance feature extraction among elements, this algorithm incorporates MSTs and minimax distance to derive a global optimum value with time complexity of $ \mathcal{O}\left({n}^2\right) $ .

An additional path-based clustering algorithm is global optimum path-based clustering (GOPC) (Liu and Zhang, Reference Liu and Zhang2019). This algorithm requires only one input parameter (the number of clusters) for execution. This parameter can be specified by the user or estimated by the algorithm. The time complexity of this algorithm is faster than $ \mathcal{O}\left(k\times {n}^2\right) $ , indicating a quicker execution than RPSC.

In Safari-Monjeghtapeh and Esmaeilpour (Reference Safari-Monjeghtapeh and Esmaeilpour2024), an algorithm called path-based clustering (FPC) has been introduced, capable of conducting the clustering operation using MSTs and the minimax distance while presenting a hybrid metaheuristic algorithm that requires only a few iterations for execution. This algorithm can achieve a general optimum value with a computational complexity of $ \mathcal{O}\left({n}^2\right). $ FPC demonstrates the ability to handle acceptable clustering of datasets with various shapes, sizes, and densities and exhibits resilience against noise and outliers. Furthermore, FPC requires only one parameter, the number of clusters, for executing the clustering process. A comparative analysis of FPC with other prominent clustering algorithms in this domain reveals that FPC showcases better, more stable, and faster performance than the different algorithms available in this field.

Considering the algorithms above in this field and the clustering results presented in Safari-Monjeghtapeh and Esmaeilpour (Reference Safari-Monjeghtapeh and Esmaeilpour2024), the algorithms used in this article have been compared in Table 1 in terms of input parameters, sensitivity to initial values, time complexity, and the ability to detect the number of clusters automatically.

Table 1. Characteristics of some path-based clustering algorithms

^a FDPC determines the number of clusters using a decision tree.

2.2. The studies in the field of clustering of listed companies

Given its unpredictability, investing in the stock market is challenging. Therefore, stock fluctuation data need examination to gain insights into market trends and behaviors. Many regression analysis tools are available to guide investors in making informed trading decisions. Alongside regression techniques, classification and clustering methods are employed to identify market trends and behaviors.

Numerous studies have been conducted in recent years in stock market data analysis using clustering techniques. In most of these studies, moment-to-moment stock market data are collected, and clustering is performed using various parameters. Subsequently, the obtained clusters are compared with existing standard market data. Analyzing and clustering online stock market data using data mining tools is one of the key methods in this domain. Data mining can be defined as an analytical method designed to discover patterns within data. Data mining algorithms analyze patterns and relationships among stored data values and then attempt to apply these patterns to prove results within new subsets of data (Guha et al., Reference Guha, Rastogi and Shim2001). Tasks in this area are primarily divided into two main parts:

1) Prediction, which aims to predict the values of required features based on the values of other features.
2) Description, aiming to extract patterns such as correlations, trends, clusters, paths, and anomalies that summarize hidden relationships within the data. Factors used in this discussion include the price/earnings ratio (P/E ratio) and earnings per share (EPS) (Setty et al., Reference Setty, Rangaswamy and Suresh2010).

In previous studies, researchers have attempted to correlate the performance and outcome of various machine learning algorithms trained and adapted to trade-related information. They aimed to associate relevant stock sections where decision-making during stock calculations and analysis and presenting suitable forms of transactional data could be possible (Suganthi and Kamalakannan, Reference Suganthi and Kamalakannan2015).

Clustering aims to group data objects with similar characteristics into clusters. Joseph and Indratmo (Reference Joseph and Indratmo2013) performed stock data clustering based on similar price movement patterns using the self-organizing map (SOM) clustering algorithm (Dragut, Reference Dragut2012). They proposed an unsupervised, multi-scale data streaming algorithm that identifies trends for evolving time series based on streaming data stimuli, enabling the simulation of trading decisions during flight.

To manage a profile of actively traded stocks, a study (Rajput and Bobde, Reference Rajput and Bobde2016) investigated the selection and trading of 138 stocks from various sectors and distinct indices using the partitioning among medoid (PAM) clustering algorithm (Rajput and Bobde, Reference Rajput and Bobde2016). Furthermore, they proposed a hybrid model for predicting stock value movement by employing opinion extraction and clustering methods for forecasting the National Stock Exchange (NSE).Their approach combined sentiment analysis outputs with DENCLUE clustering to predict stock market behavior.

A proposed hybrid model is based on the value of technical indicators to present a final prediction for each stock. A study conducted in Renugadevi et al. (Reference Renugadevi, Ezhilarasie, Sujatha and Umamakeswari2016) provides investors with a short-term list of recommended stocks. Additionally, the author in Bini and Mathew (Reference Bini and Mathew2016) suggests an analysis system aiding investors in identifying more profitable companies. This system utilizes clustering and future price prediction for the identified profitable companies through retrospective techniques. The author evaluates the performance of partitioning, hierarchical, model-based, and density-based techniques using credibility indices such as C-index, Jaccard index, Rand index (RI), and Silhouette index. The clustering results are input for multiple retrospective techniques to predict future stock prices. Research conducted in Lee et al. (Reference Lee, Lin, Kao and Chen2010) indicates short-term stock price movements following the release of financial reports using the proposed hierarchical recursive K-means (HRK) algorithm. The proposed framework classifies stock time series based on similarity in price trends. HRK is compared to a support vector machine (SVM), demonstrating better accuracy and average profit in the predictive model. In Goswami et al. (Reference Goswami, Bhensdadia and Ganatra2009), a candlestick analysis-based prediction model is utilized to forecast short-term stock price fluctuations. Moreover, clustering has been employed as a preprocessing step for stock market prediction (Patil and Joshi, Reference Patil and Joshi2020).

Portfolio management for Indian stock data has been conducted using self-organizing maps (SOM) and K-means clustering (Nanda et al., Reference Nanda, Mahanty and Tiwari2010). The aim is to categorize based on categories controlled by the human brain using SOM. The SOM consists of neurons forming a two-dimensional structure with neighboring relationships between neuron pairs. Parameters in SOM are divided into short term (usually one day to 30 days) and long term (usually three months to 1 year). Factors utilized include price/earnings (P/E) ratio, price/book value (P/BV), price/cash EPS (P/CEPS), EV/EBIDTA, and market cap/sales. In this discussion, the performance indices of the K-means algorithm, SOM, and fuzzy C-means have been calculated and analyzed.

In Aghabozorgi and Teh (Reference Aghabozorgi and Teh2014), a novel three-phase clustering model has been proposed for categorizing companies based on similarities in their stock market shapes. Initially, low-resolution time series data are utilized for an approximate categorization of companies. Then, in the second phase, pre-clustered companies are further divided into several pure subclusters. Finally, the subsidiary clusters are merged in the third phase.

In Alemohammad (Reference Alemohammad2019), phasic clustering based on partitioning around medoids is employed to partition financial time series. The GARCH parameterization approach measures the similarity or dissimilarity between the time series. This method is applied in Alemohammad (Reference Alemohammad2019) to cluster some major Asia-Pacific stock markets.

Furthermore, in Zhong and Enke (Reference Zhong and Enke2017), a comprehensive data mining process has been outlined to predict the daily return value of the S&P 500 Index ETF (SPY) based on 60 financial and economic features.

In Lúcio and Caiado (Reference Lúcio and Caiado2022), the impact of COVID-19 on certain S&P 500 industries is assessed using a novel clustering method based on feature and TGARCH modeling. Instead of utilizing model-estimated parameters to calculate a distance matrix for stock indices, the approach employs a distance estimation based on autocorrelations of the estimated conditional volatilities. Both hierarchical and nonhierarchical algorithms are utilized to assign a set of industries to clusters.

3. Description and preprocessing of research data

The data for this research were obtained from the Tehran Stock Exchange Technology Management Company website (http://www.tsetmc.com). These datasets encompass companies related to the cement industry and their factories. Initially, the data for cement companies listed on the Tehran Stock Exchange were extracted using the TseClient 2.0 software within two years, from April 2021 to April 2023. Subsequently, among these datasets, companies with invalid or missing data (due to the absence of stock market activity on the specific date being examined or having an insignificant number of trading days and invalid trading volumes) were excluded. The initial list included 85 companies, of which only 45 shares were valid and analyzable. The list of these companies can be seen in Table 2.

Table 2. Researched companies taken from the stock exchange

The data utilized for this research from each company includes the following components:

• Opening price (OPEN)
• Highest price (HIGH)
• Lowest price (LOW)
• Closing price (CLOSE)

The data interval ranges from April 2021 to April 2023, recorded daily. Initially, we processed data, removing undefined, non-numeric, and missing values from the dataset. For each data entry, all features were acquired over the 24 months under study (from April 2021 to April 2023) and examined across various s as follows:

Initially, for each day’s transactions of each listed company, the price changes were calculated using the formula below and stored in an array representing the price changes in one row:

(2)

$$ \mathrm{C}{\mathrm{o}}_{\mathrm{M}{\mathrm{o}}_{\mathrm{d}}}=\mathrm{Pric}{\mathrm{e}}_{\mathrm{Clos}{\mathrm{e}}_{\mathrm{d}}}-\mathrm{Pric}{\mathrm{e}}_{\mathrm{Ope}{\mathrm{n}}_{\mathrm{d}}} $$

where Co represents the stocks of the respective listed company and $ \mathrm{C}{\mathrm{o}}_{\mathrm{M}{\mathrm{o}}_{\mathrm{d}}} $ denotes the price changes of those stocks on day d. $ \mathrm{Pric}{\mathrm{e}}_{\mathrm{Clos}{\mathrm{e}}_{\mathrm{d}}} $ and $ \mathrm{Pric}{\mathrm{e}}_{\mathrm{Ope}{\mathrm{n}}_{\mathrm{d}}} $ correspond to those stocks’ closing and opening prices on day d, respectively. On the other hand, each company’s percentage change in price was computed and stored for each day using the following formula:

(3)

$$ \mathrm{C}{\mathrm{o}}_{\mathrm{PM}{\mathrm{o}}_{\mathrm{d}}}=\frac{\left(\mathrm{Pric}{\mathrm{e}}_{\mathrm{C}\mathrm{los}{\mathrm{e}}_{\mathrm{d}}}-\mathrm{Pric}{\mathrm{e}}_{\mathrm{Ope}{\mathrm{n}}_{\mathrm{d}}}\right)}{\mathrm{Pric}{{\mathrm{e}}_{\mathrm{Ope}\mathrm{n}}}_{\mathrm{d}}}=\frac{\mathrm{C}{\mathrm{o}}_{\mathrm{M}{\mathrm{o}}_{\mathrm{d}}}}{\mathrm{Pric}{\mathrm{e}}_{\mathrm{Ope}{\mathrm{n}}_{\mathrm{d}}}} $$

where Co represents the stocks of the respective listed company, $ \mathrm{C}{\mathrm{o}}_{\mathrm{PM}{\mathrm{o}}_{\mathrm{d}}} $ signifies the percentage change in the price of those stocks on day d. $ \mathrm{Pric}{\mathrm{e}}_{\mathrm{Clos}{\mathrm{e}}_{\mathrm{d}}} $ represents the closing price of those stocks, and $ \mathrm{Pric}{\mathrm{e}}_{\mathrm{Ope}{\mathrm{n}}_{\mathrm{d}}} $ denotes the opening price of those stocks on day d. These values were calculated for each listed company and stored in one array row for each day.

Based on these stock values of active companies in the cement sector, the percentage return and deviation from the average price changes for these stocks will be calculated and stored five times for the two components.

(4)

$$ \mathrm{C}{\mathrm{o}}_{\mathrm{Recor}{\mathrm{d}}_{\mathrm{range}}}=\left[\mathrm{mean}\left(\mathrm{movsum}\left(\mathrm{C}{\mathrm{o}}_{\mathrm{PM}{\mathrm{o}}_{\mathrm{d}}},\mathrm{rannge}\right)\right);\mathrm{mean}\left(\mathrm{movstd}\left(\mathrm{C}{\mathrm{o}}_{\mathrm{M}{\mathrm{o}}_{\mathrm{d}}},\mathrm{range}\right)\right)\right] $$

These components, denoted as $ \mathrm{C}{\mathrm{o}}_{\mathrm{Recor}{\mathrm{d}}_{\mathrm{range}}} $ , represent the percentage price changes and deviations from the average price changes within the historical range. These times consist of

-Weekly time (5 working days)

-Monthly time (21 working days)

-Three-month time (63 working days)

-Six-month time (126 working days)

-One-year time (252 working days)

Following the preprocessing of the data, information from 45 samples (cement companies) will be fed into the researched algorithms. In the subsequent section, the evaluation metrics for this data will be introduced, along with explanations of how each is calculated.

4. Evaluation criteria

The main idea in assessing clustering is to compare the intra-cluster (within-cluster) distance to the intercluster (cluster-to-cluster) distance to determine how well clusters are separated. A good clustering should have a small intra-cluster distance and a large intercluster distance. For evaluating clustering, the following metrics are introduced:

4.1. Silhouette coefficient

As one of the most widely used clustering evaluation metrics, the silhouette coefficient (S) compares intra-cluster and intercluster distances into a score ranging from −1 to 1. A value of 1 signifies an excellent clustering result, where intercluster distances are much larger than intra-cluster ones. Conversely, a value close to −1 indicates a completely wrong assignment of clusters, where intercluster distances are not comparable to intra-cluster distances (Rousseeuw, Reference Rousseeuw1987). Regarding the intra-cluster distance, for each data point “i” inside cluster “C,” “a” is defined as the average distance between “i” and other data points within “C.”

(5)

$$ a(i)=\frac{1}{\left|{C}_{\mathrm{I}}\right|-1}{\sum}_{j\in {C}_{\mathrm{I}},i\ne j}d\left(i,j\right) $$

where $ \left|{C}_{\mathrm{I}}\right| $ is the number of points belonging to cluster i and $ d\left(i,j\right) $ is the distance between data points i and j in cluster $ {C}_{\mathrm{I}} $ .

Thus, for any given point i, a small score a(i) indicates a good clustering assignment for point i because it is close to the points of the same cluster. In contrast, a large a(i) score indicates bad clustering for point i because it is far from its cluster points.

For intercluster distance, for each data point i inside cluster C, b is defined as the smallest distance from i to all points in any other cluster of which i is not a member. In other words, b is the average distance between i and all points of its nearest neighbor cluster.

(6)

$$ b(i)=\underset{J\ne I}{\min}\frac{1}{\left|{C}_J\right|}{\sum}_{j\in {C}_J}d\left(i,j\right) $$

After obtaining the average intra-cluster and intercluster distances for each data point in the dataset, the silhouette score is defined as follows:

(7)

$$ s(i)=\frac{b(i)-a(i)}{\mathit{\max}\left\{a(i),b(i)\right\}},\mathsf{if}\;\left|{C}_I\right|>1 $$

The silhouette score is zero in the rare situation where ∣CI∣=1 (indicating only one data point “i” in cluster “C”). In the above formula, we observe that the score is completely bounded between negative one and one, where a larger score indicates better separation between clusters. One of the most significant advantages of the silhouette score is its straightforward interpretation and clear constraints. However, its major drawback lies in its computational cost. The exceptionally long execution time in a relatively large dataset makes it less practical and less useful in real-world applications.

4.2. The Calinski–Harabasz index

The Calinski–Harabasz index (CH), also known as the ratio of variances, measures the ratio of the sum of squared distances between clusters to the sum of squared distances within clusters for all clusters. This sum of squared distances is corrected by the degrees of freedom. Here, the within-cluster sum is based on the distances of data points within a cluster to its centroid, and the between-cluster sum is estimated based on the distances between the cluster centroids and the overall centroid (Calinski and Harabasz, Reference Calinski and Harabasz1974). The CH for K clusters in the dataset D is defined as follows:

(8)

$$ CH=\frac{\left[\frac{\sum_{k=1}^K{n}_k{\left\Vert {c}_k-c\right\Vert}^2}{K-1}\right]}{\left[\frac{\sum_{k=1}^K{\sum}_{i=1}^{n_k}{\left\Vert {d}_i-{c}_k\right\Vert}^2}{N-K}\right]} $$

In which di represents the feature vector of data point i, $ {n}_k $ denotes the size of cluster k, $ {c}_k $ is the feature vector of the centroid of cluster k, c represents the feature vector of the global centroid of the entire dataset, and N is the total number of data points. We can see that the numerator is a weighted sum (based on cluster size $ {n}_k $ ) of the squared distance from each unique cluster centroid to the overall centroid divided by the degrees of freedom. The denominator is the sum of squared distances from each data point to its cluster centroid, divided by the degrees of freedom. The degrees of freedom are used to scale these two parts onto a unified scale. From the equation above, a higher metric value indicates better separation between clusters, and there is no upper bound for metrics like the silhouette score.

4.3. The Davies–Bouldin index

The Davies–Bouldin index (DB) is similar to the CH, but it calculates the ratio of intra-cluster distance inversely to the CH index. In the computation of the DB, the concept of similarity score, which measures the similarity between two clusters, is present. It is defined as follows:

(9)

$$ {R}_{i,j}=\frac{S_i+{S}_j}{M_{i,j}} $$

where R_ij is the similarity score, S_i and S_j are the average distances from points to the centroids within clusters i and j, respectively, and M_ij represents the distance between centroids of cluster i and cluster j. The equation shows that a lower similarity score indicates better cluster separation, as a small number implies a small intra-cluster distance and a large denominator represents a large intercluster distance. The DB is the average similarity score across all clusters with their closest neighboring cluster (Davies and Bouldin, Reference Davies and Bouldin1979).

(10)

$$ {D}_i=\underset{j\ne i}{\mathit{\max}}{R}_{i,j} $$

(11)

$$ DB=\frac{1}{N}{\sum}_{i=1}^N{D}_i $$

In this equation, D_i represents the worst (largest) similarity score of cluster one among all other clusters, and the final DB index is the average of clusters D_i among N clusters and N is the total number of data points.

A smaller DB index signifies better cluster separation. However, this metric shares a weakness similar to the CH index, as it doesn’t perform well in managing shape-agnostic clustering methods (such as density-based clustering). Nevertheless, CH and DB indices are much quicker to compute than the silhouette score.

4.4. The RI and adjusted Rand index

The RI and adjusted Rand index (ARI) are metrics commonly used to compare clustering results with external criteria. Calculating these indices is somewhat more complex than others, involving a formula that counts pairwise combinations. For a given number of N cases, the total number of pairs that can be created disregarding the order is denoted by C(N,2) and calculated as follows:

(12)

$$ C\left(N,2\right)=\frac{N.\left(N-1\right)}{2} $$

Please note that both C(1,2) = 0 and C(0,2) = 0.

In the contingency matrix, $ {V}_{ij} $ represents the count of elements present in the intersection between class partition $ {\mathrm{A}}_i $ from partition A and class partition $ {\mathrm{B}}_j $ from partition B.

When $ {V}_{ij}>0 $ , it indicates overlap between partitions. The sum of each row $ {S}_{i\ast } $ equals the count of elements in the class partition $ {\mathrm{A}}_i $ , and the sum of each column $ {S}_{\ast j} $ equals the count of elements in the class partition $ {\mathrm{B}}_j $ . The total sum $ {S}_{mn} $ equals the count of elements in set S.

The calculation of the RI and the ARI can be expressed in four values x, y, z, and w, which are defined as follows:

(13)

$$ x={\sum}_{i,j}C\left({V}_{ij},2\right) $$

(14)

$$ y={\sum}_iC\left({S}_{i\ast },2\right)-x $$

(15)

$$ z={\sum}_jC\left({V}_{\ast j},2\right)-x $$

(16)

$$ w=C\left({S}_{mn},2\right)-x-y-z $$

(17)

$$ RandIndex=\frac{x+w}{x+y+z+w} $$

(18)

$$ AdjustedRandIndex=\frac{x-\left(\frac{\left(y+x\right)\left(z+x\right)}{x+y+z+w}\right)}{\frac{\left(y+z+2x\right)}{2}-\left(\frac{\left(y+x\right)\left(z+x\right)}{x+y+z+w}\right)} $$

The parameters $ x,y,z,\mathrm{and}\;w $ are calculated based on the formulas 13–16. Since we do not know the correct category of each company in the clustering of companies, we only use the first three criteria of this section, that is, Silhouette, Kalinsky–Harabasz, and Davis–Bouldin, which are respectively the maximum of the first two criteria and the minimum of the third criterion. In the next section, the method of clustering and the results of each will be presented (Rand, Reference Rand1971; Hubert and Arabie, Reference Hubert and Arabie1985).

5. Experimental results and comparison with other algorithms

In this section, the stock values of listed companies active in cement in two short-term and long-term intervals will be clustered and analyzed for the two components of yield percentage and deviation from the average price changes. In this study, we use K-means, GOPC, IPC, and FDPC algorithms to compare FPC algorithm clustering, a prominent algorithm in data clustering. In this way, the clustering results of the FPC algorithm and the compared algorithms will be reviewed and compared based on the clustering evaluation criteria. We examine each listed company in terms of two percentages of price changes and deviation from the average price changes so the clustering results can be displayed in two dimensions.

5.1. Review of listed companies in the short term

This section will focus on clustering cement company stocks within short-term times, specifically weekly and monthly. We first need to determine the number of clusters (K) for the clustering process to initiate this process. Since we lack prior knowledge about this value, we’ll employ the K-means algorithm with a variable number of clusters ranging from 2 to 20. Subsequently, the silhouette metric will be calculated for each case. The optimal cluster count will be determined based on the cluster with the highest silhouette score. Considering the stochastic nature of the K-means algorithm, this process will be executed 30 times, and the most prevalent value among the outcomes will serve as the reference for determining the cluster count (K) for further analysis.

5.1.1. Weekly interval

Diagram 1 shows the result obtained from running the algorithm on stock market data within the weekly time frame, as specified:

Diagram 1. Silhouette score for different clusters within the weekly time.

Diagram 1 shows that within this time frame, the optimal number of clusters for clustering stock companies within the cement sector is two. Following this stage, we execute the FPC and rival algorithms considering the obtained K for the weekly in equation (4). Figure 1 illustrates the clustering outcomes of stock market companies within the weekly using the compared algorithms.

Figure 1. Clustering results of cement company stocks with compared algorithms within the weekly ((a) K-means algorithm, (b) GOPC algorithm, (c) IPC algorithm, (d) FDPC algorithm, and (e) FPC algorithm).

Based on the results of Figure 1, it can be said that the points in the bottom-left represent stock companies with higher negative profit percentages and lower price fluctuations in the weekly interval compared to other companies. Similarly, the points in the top-right indicate companies with higher positive profit percentages and more significant price fluctuations in the weekly interval than other companies. The evaluation results of three different clustering metrics are presented in Table 3.

Table 3. Comparison of evaluation results of compared algorithms for stock company data weekly

The results show that the FPC method, similar to the K-means algorithm, provides better clustering in the first two metrics than other algorithms. In the third metric, the FDPC algorithm has the lowest value. However, given that its values in the first two metrics are considerably lower than those of other algorithms, its DB index value becomes unreliable. Consequently, both the K-means and FPC methods exhibit better performance in this experiment, indicating the creation of higher-quality clusters.

The results show that the stock companies related to the cement industry in the short-term weekly interval can be divided into two categories. Each of these categories has a mean or cluster center, meaning each cluster resembles a company, or in other words, a company represents that cluster, as presented in Table 4.

Table 4. Analysis of clustering results of cement companies in the weekly interval

Considering the results in Table 4 and Figure 1, it can be observed that focusing on companies’ stocks in Cluster 2 could be beneficial for short-term and daily investments. This is because, within this cluster, all stock companies demonstrate positive returns, suitable for short-term investment, specifically weekly. The investigated companies list included 85 companies, of which only 45 shares were valid and analyzable.

5.1.2. One-month Interval

Diagram 2 shows the results obtained from running the algorithm on stock market data for the one-month interval, with the specified attributes:

Diagram 2. Silhouette score for different clusters in the one-month time frame.

From the above figure, it’s evident that seven clusters appear to be the most suitable for clustering cement companies in the stock market for this time frame. Following this stage, we execute the FPC algorithm and its competitor algorithms with the obtained K value, considering the one-month interval in equation (4). Figure 2 illustrates the clustering results of stock market companies for the one-month interval using the compared algorithms.

Figure 2. Clustering results of cement company stocks with compared algorithms in the one-month interval ((a) K-means algorithm, (b) GOPC algorithm, (c) IPC algorithm, (d) FDPC algorithm, and (e) FPC algorithm).

Based on the results shown in Figure 2, the points in the bottom-left section represent stock market companies with higher negative profit percentages and fewer price variations within the one-month interval than other companies. Similarly, the points in the top-right section indicate companies with higher positive profit percentages and more significant price changes within the one-month interval than the rest. The evaluation results of three different clustering metrics are presented in Table 5.

Table 5. Comparison of evaluation results for the compared algorithms concerning the stock market company data in the one-month interval

The outcomes show that the FPC method demonstrates better clustering in the first two metrics compared to the other algorithms. Furthermore, GOPC, IPC, and FDPC algorithms exhibit the lowest values in the third metric. However, given their substantially lower values in the first two metrics than the other algorithms, their DB metric value becomes invalid. Following these assessments, the FPC algorithm presents the most favorable outcome. Therefore, in this experiment, the FPC method generates higher-quality clusters.

The results show that cement-related stock market companies within the short-term one-month interval can be classified into seven categories. Each of these categories possesses an average or cluster center. In other words, each cluster represents a company, or in simpler terms, a company representative of that cluster, as displayed in Table 6.

Table 6. Analysis of clustering results for cement company stocks in the one-month interval

Based on the findings from Table 6 and Figure 2, it can be observed that focusing on stocks belonging to clusters 1, 4, and 5 would be advisable for short-term investments within the one-month interval. This recommendation stems from the fact that in these clusters, all stock market companies demonstrate positive returns and are suitable for short-term investments within one month. The companies falling into clusters 3 and 6 maintain a balanced outcome, signifying relatively moderate profits and customary fluctuations within the one-month interval, positioning them in an intermediate group. Finally, companies placed in Clusters 2 and 7 are not recommended for one-month investment due to their specific characteristics.

5.2. Long-term review of the listed companies

In this section, we will focus on clustering the stocks of cement companies in the long term, spanning over three-month, six-month, and one-year time frames. To conduct this analysis, we must determine the number of clusters (K) for the clustering process. Since the optimal number of clusters is unknown, we will employ the K-means algorithm with varying clusters from 2 to 20, calculating the silhouette score for each case. The higher silhouette score will determine the optimal number of clusters. As the K-means algorithm possesses a random structure, we will repeat this process 30 times, considering the predominant value among the generated scenarios as the reference for determining the number of clusters (K) for further analysis.

5.2.1. Three-month interval

Diagram 3 shows the result of running the algorithm on the stock market data in an interval of three months with the mentioned specifications.

Diagram 3. Silhouette criteria for different clusters in a three-month interval.

It can be seen from the above picture that in this interval of time, the number of four clusters is the best number for the clustering of cement companies. After this step, we run the FPC and competing algorithms with the number K obtained and consider the three-month in equation (4). Figure 3 shows the results of the clustering of stock companies in three months with the compared algorithms.

Figure 3. The results of the clustering of cement companies’ shares with the compared algorithms in the three-month interval ((a) K-means algorithm, (b) GOPC algorithm, (c) IPC algorithm, (d) FDPC algorithm, and (e) FPC algorithm).

According to the results of Figure 3, it can be said that the points that are in the lower left part represent the listed companies with a higher percentage of negative profit and fewer price changes in the three-month interval than other companies, and in the same way, the points that are in the upper right part of the picture represent the companies that have more positive profit percentage and more price changes in the three-month interval compared to other companies. The evaluation results of three different clustering criteria are shown in Table 7.

Table 7. Comparison of the evaluation results of the compared algorithms for the data of the listed companies in three months

As can be seen from the results, the FPC method achieves better clustering in the first two criteria than other algorithms. In the third criterion, IPC and GOPC algorithms have the lowest value. Still, considering that their values are lower than those of different algorithms in the first two criteria, it can be concluded that the FPC method creates higher-quality clusters in this test.

By analyzing the results, we conclude that the listed companies related to the cement industry in the long term can be divided into four categories in three months. Each of these categories has a mean or cluster center. That is, each cluster is similar to a company, or in other words, a company is a representative of that cluster, which is shown in Table 8.

Table 8. Analysis of the results of the clustering of shares of cement companies in three months

Given the results in Table 8 and Figure 3, it can be said that if you want to invest in the interval of three months, you can focus on the shares of the companies that are in Cluster 3 because, in this cluster, all the companies in stock markets have positive returns and are suitable for quarterly investments. The listed companies that are in Clusters 1 and 2 have balanced results. They have relative profitability and conventional changes in the three-month interval and are in the middle group. Finally, the companies in Cluster 4 are not recommended for quarterly investment.

5.2.2. Six-month interval

Diagram 4 shows the result of running the algorithm on the stock market data in a six-month interval with the mentioned specifications

Diagram 4. Silhouette criteria for different clusters in a six-month interval.

It can be seen from the above image that in this interval of time, the number of three clusters is the best number for the clustering of listed cement companies. After this step, we run the FPC algorithm and competing algorithms with the obtained number of K, considering the 126-day in equation (4) of the algorithms. Figure 4 shows the results of the clustering of stock companies in a six-month interval with the compared algorithms.

Figure 4. The results of the clustering of cement companies’ shares with the compared algorithms in the six-month interval ((a) K-means algorithm, (b) GOPC algorithm, (c) IPC algorithm, (d) FDPC algorithm, and (e) FPC algorithm).

According to the results of Figure 4, it can be said that the points in the lower left part represent the listed companies with a higher negative profit percentage and fewer price changes in the six-month interval than other companies. Similarly, the points in the upper right of the picture represent companies with more positive profit percentages and more price changes in the six-month interval compared to other companies. The evaluation results of three different clustering criteria are given in Table 9.

Table 9. The comparison of the evaluation results of the compared algorithms for the data of the stock exchange companies in six months

As can be seen from the results of this section, the FPC and K-means methods achieve better clustering in all three criteria than other algorithms. After analyzing the results of this section, it can be concluded that in this experiment, FPC and K-means create better quality clusters.

By analyzing the results, we conclude that the listed companies related to the cement industry in the long term can be divided into three categories in six months. Each of these categories has a mean or cluster center. That is, each cluster is similar to a company, or in other words, a company is a representative of that cluster, which is shown in Table 10.

Table 10. Analysis of the results of the clustering of cement companies’ shares in a six-month interval

According to the results of Table 10 and Figure 4, it can be said that if you want to make a long-term investment in six months, you can focus on the shares of companies that are in Cluster 1 because, in this cluster, all stock companies have positive returns and are suitable for quarterly investments.

Stock exchange companies that are in Cluster 2 have balanced results. They have relative profitability and conventional changes in the three-month interval and are in the middle group. Finally, the companies in Cluster 3 are not recommended for quarterly investment.

5.2.3. One-year interval

Diagram 5 shows the result of running the algorithm on the stock market data in a one-year interval with the mentioned specifications.

Diagram 5. Silhouette criteria for different clusters in a one-year interval.

It can be seen from the above image that in this interval of time, the number of three clusters is the best number for the clustering of listed cement companies. After this step, we run the FPC algorithm and competing algorithms with the number of K obtained, considering the one-year in equation (4) of the algorithms. Figure 5 shows the results of the clustering of listed companies in a one-year interval with the compared algorithms.

Figure 5. The results of the clustering of cement companies’ shares with the algorithms compared in one year ((a) K-means algorithm, (b) GOPC algorithm, (c) IPC algorithm, (d) FDPC algorithm, and (e) FPC algorithm).

The factors affecting the number of companies are obtained from the Tehran Stock Exchange Technology Management Company. These datasets encompass companies related to the cement industry and their factories within two years, from April 2021 to April 2023. According to the results of Figure 5, it can be said that the points that are in the lower left part represent the listed companies with a higher negative profit percentage and fewer price changes in a one-year interval than other companies, and similarly, the points that are in the upper right part of the picture represent companies that have more positive profit percentages and more price changes in a one-year interval than other companies. The evaluation results of three different clustering criteria are shown in Table 11.

Table 11. Comparison of the evaluation results of the compared algorithms for the data of the listed companies in one year

As can be seen from the results of this section, FPC, FDPC, and K-means methods achieve better clustering in all three criteria than IPC and GOPC algorithms. After analyzing the results of this section, it can be concluded that FPC, FDPC, and K-means methods create better quality clusters in this experiment. Based on the analysis of the results, it can be deduced that cement-related companies in the stock market, over the long-term one-year interval, can be categorized into three distinct clusters. Each of these clusters possesses an average or central point, essentially representing a company within that cluster, as shown in Table 12.

Table 12. Analysis of the results of the clustering of shares of cement companies in a one-year interval

From the findings in Table 12 and Figure 5, it can be inferred that for those interested in long-term investments within a one-year time frame, focusing on stocks associated with Cluster 2 could be prudent. This cluster includes companies exhibiting positive returns and suitable for longer-term investment strategies. Companies in Cluster 1 exhibit a balanced performance, indicating relatively moderate profitability and standard fluctuations within the three-month interval, positioning them within a middle ground. Finally, companies falling within Cluster 3 are not recommended for three-month investments.

6. Discussion

Table 13 shows the listed companies and their cluster numbers for each interval.

Table 13. The results of stock clustering of listed companies active in the field of cement

According to Table 13, the companies in each cluster can be checked for the time intervals evaluated in this article.

7. Conclusion

This study aimed to cluster and evaluate listed companies active in cement. In this regard, 45 cement companies were evaluated. It evaluates capital return and deviation from the average price changes in short term (weekly, one month, and three months) and long term (three months, six months, and one year). K-means, IPC, GOPC, and FDPC clustering algorithms were used to evaluate and cluster these data, and these algorithms were evaluated and compared with the FPC algorithm. The results of clustering and evaluation of clustering algorithms indicated that the FPC algorithm produced better quality clusters in all cases and showed acceptable performance in different clustering. On the other hand, by examining the profitability and price changes of listed companies, the representative listed company of each cluster was identified in each time interval.

In Tables 4, 6, 8, 10, and 12, we found which companies represent each cluster among all the companies in the stock market and active in cement. Also, suitable companies for investment were introduced. Also, companies unsuitable for investment in any specific interval were introduced. According to Table 13, these clusters can be seen for each company and considered in short-term and long-term investments.

Data availability statement

The data for this research were obtained from the Tehran Stock Exchange Technology Management Company website (https://www.khanehsiman.com, available August 24, 2024).

Author contribution

All of the authors have participated in all parts of the manuscript.

Competing interest

The authors declare no conflicts of interest to report regarding the present study.

Ethical approval

This article does not contain any studies with human participants or animals performed by authors.

References

Abu-Mostafa, YS and Atiya, AF (1996) Introduction to financial forecasting. Applied Intelligence 6 (3), 205–213.Google Scholar

Aghabozorgi, S and Teh, YW (2014) Stock market co-movement assessment using a three-phase clustering method. Expert Systems with Applications 41 (4), 1301–1314.CrossRef Google Scholar

Alemohammad, SN (2019) GARCH model-based fuzzy clustering of asian and oceania stock markets. In 15th Iran International Industrial Engineering Conference (IIIEC). Yazd, Iran: IEEE, pp. 81–84.Google Scholar

Antonakakis, N, Cunado, J, Filis, G, Gabauer, D and de Gracia, FP (2023) Dynamic connectedness among the implied volatilities of oil prices and financial assets: New evidence of the COVID-19 pandemic. International Review of Economics & Finance 83, 114–123.CrossRef Google Scholar

Bini, BS and Mathew, T (2016) Clustering and regression techniques for stock prediction. Procedia Technology 24, 1248–1255.CrossRef Google Scholar

Calinski, T and Harabasz, J (1974) A dendrite method for cluster analysis. Communications in Statistics 3 (1), 1–27.Google Scholar

Chang, H and Yeung, DY (2008) Robust path-based spectral clustering. Pattern Recognition 41 (1), 191–203.CrossRef Google Scholar

Davies, DL and Bouldin, DW (1979) A cluster separation measure. IEEE Transactions on Pattern Analysis and Machine Intelligence 1 (2), 224–227.CrossRef Google Scholar PubMed

Deboeck, GJ (1994) Trading on the Edge: Neural, Genetic, and Fuzzy Systems for Chaotic Financial Markets. WileyGoogle Scholar

Dhillon, IS, Guan, Y and Kulis, B (2004) Kernel k-means: Spectral clustering and normalized cuts. In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Seattle WA USA: ACM, pp. 551–556.CrossRef Google Scholar

Dragut, AB (2012) Stock data clustering and multiscale trend detection. Methodology and Computing in Applied Probability 14 (1), 87–105.CrossRef Google Scholar

Enke, D and Thawornwong, S (2005) The use of data mining and neural networks for forecasting stock market returns. Expert Systems with Applications 29 (4), 927–940.CrossRef Google Scholar

Ester, M, Kriegel, HP, Sander, J and Xu, X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining (KDD-96), Portland, USA, pp. 226–231.Google Scholar

Fischer, B and Buhmann, JM (2003) Path-based clustering for grouping of smooth curves and texture segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 25(4), 513–518.CrossRef Google Scholar

Goswami, MM, Bhensdadia, CK and Ganatra, AP (2009) Candlestick analysis based short term prediction of stock price fluctuation using SOM-CBR. In IEEE International Advance Computing Conference. IEEE, pp. 1448–1452.Google Scholar

Guha, S, Rastogi, R and Shim, K (2001) CURE: An efficient clustering algorithm for large databases. Information Systems 26 (1), 35–58.CrossRef Google Scholar

Hubert, L and Arabie, P (1985) Comparing partitions. Journal of Classification 2 (1), 193–218.CrossRef Google Scholar

Jain, K (2010) Data clustering: 50 years beyond K-means. Pattern Recognition Letters 31 (8), 651–666.CrossRef Google Scholar

Jensen, MC (1978) Some anomalous evidence regarding market efficiency. Journal of Financial Economics 6 (2/3), 95–101.CrossRef Google Scholar

Joseph, J and Indratmo, I (2013) Visualizing stock market data with self-organizing map. In Proceedings of the 26th International Florida Artificial Intelligence Research Society Conference, pp. 488–491.Google Scholar

Jothi, R, Mohanty, K and Ojha, A (2018) Fast approximate minimum spanning tree based clustering algorithm. Neurocomputing 272, 542–557.CrossRef Google Scholar

Kang, W, de Gracia, FP and Ratti, RA (2017) Oil price shocks, policy uncertainty, and stock returns of oil and gas corporations. Journal of International Money and Finance 70, 344–359.CrossRef Google Scholar

Laszlo, M and Mukherjee, S (2005) Minimum spanning tree partitioning algorithm for microaggregation. IEEE Transactions on Knowledge and Data Engineering 17 (7), 902–911.CrossRef Google Scholar

Lee, AJT, Lin, MC, Kao, RT and Chen, K T (2010) An effective clustering approach to stock market prediction. In Pacific Asia Conference on Information Systems Proceedings, Vol. 54, pp. 345–354.Google Scholar

Liu, Q and Zhang, R (2019) Global optimal path-based clustering algorithm. CoRR, arXiv:1909.07774.Google Scholar

Liu, Q, Zhang, R, Hu, R, Wang, G, Wang, Z and Zhao, Z (2019) An improved path-based clustering algorithm. Knowledge-Based Systems 163, 69–81.CrossRef Google Scholar

Lo, AW and MacKinlay, AC (1988) Stock market prices do not follow random walks: Evidence from a simple specification test. The Review of Financial Studies 1 (1), 41–66.CrossRef Google Scholar

Lúcio, F and Caiado, J (2022) COVID-19 and stock market volatility: A clustering approach for S&P 500 industry indices. Finance Research Letters 49, 103141.CrossRef Google Scholar

Nanda, SR, Mahanty, B and Tiwari, MK (2010) Clustering Indian stock market data for portfolio management. Expert Systems with Applications 37 (12), 8793–8798.CrossRef Google Scholar

Ng, AY, Jordan, MI and Weiss, Y (2002) On spectral clustering: Analysis and an algorithm. Advances in Neural Information Processing Systems 2 (14), 849–856.Google Scholar

Patil, Y and Joshi, M (2020) Cluster driven candlestick method for stock market prediction. In International Conference on System, Computation, Automation and Networking (ICSCAN), pp. 1–5.CrossRef Google Scholar

Pirim, H, Eksioglu, B and Perkins, AD (2015) Clustering high throughput biological data with B-MST, a minimum spanning tree based heuristic. Computers in Biology and Medicine 62, 94–102.CrossRef Google Scholar PubMed

Rajput, V and Bobde, S (2016) Stock market prediction using hybrid approach. In International Conference on Computing, Communication and Automation (ICCCA). Greater Noida, India: IEEE, pp. 82–86.Google Scholar

Rand, WM (1971) Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association 66 (336), 846–850.CrossRef Google Scholar

Renugadevi, T, Ezhilarasie, R, Sujatha, M and Umamakeswari, A (2016) Stock market prediction using hierarchical agglomerative and K-means clustering algorithm. Indian Journal of Science and Technology 9(48).CrossRef Google Scholar

Rodriguez, A and Laio, A (2014) Clustering by fast search and find of density peaks. Science 344(6191), 1492–1496.CrossRef Google Scholar PubMed

Rousseeuw, PJ (1987) Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics 20, 53–65.CrossRef Google Scholar

Safari-Monjeghtapeh, L and Esmaeilpour, M (2024) Path-based clustering algorithm with high scalability using the combined behavior of evolutionary algorithms. Computer Systems Science and Engineering 48 (3), 705–721.CrossRef Google Scholar

Setty, DV, Rangaswamy, TM and Suresh, AV (2010) Analysis and clustering of nifty companies of share market using data mining tools. Journal of Engineering Research and Studies 1 (1), 152–164.Google Scholar

Suganthi, R and Kamalakannan, P (2015) Analyzing stock market data using clustering algorithm. International Journal of Future Computer and Communication 4 (2), 108–111.CrossRef Google Scholar

Von Luxburg, U (2007) A tutorial on spectral clustering. Statistics and Computing 17 (4), 395–416.Google Scholar

Wang, JZ, Wang, JJ, Zhang, ZG and Guo, SP (2011) Forecasting stock indices with back propagation neural network. Expert Systems with Applications 38 (11), 14346–14355.CrossRef Google Scholar

Wang, X, Wang, XL, Chen, C and Wilkes, DM (2013) Enhancing minimum spanning tree-based clustering by removing density-based outliers. Digital Signal Processing 23 (5), 1523–1538.CrossRef Google Scholar

Wang, X, Wang, XL and Wilkes, M (2009) A divide-and-conquer approach for minimum spanning tree-based clustering. IEEE Transactions on Knowledge and Data Engineering 21 (7), 945–958.CrossRef Google Scholar

Wu, J, Li, X, Jiao, L, Wang, X and Sun, B (2013) Minimum spanning trees for community detection. Physica A: Statistical Mechanics and its Applications 392 (9), 2265–2277.CrossRef Google Scholar

Yao, J, Tan, LC and Poh, HL (1999) Neural networks for technical analysis: A study on KLCI. International Journal of Theoretical and Applied Finance 2 (2), 221–241.CrossRef Google Scholar

Zahn, CT (1971) Graph-theoretical methods for detecting and describing gestalt clusters. IEEE Transactions on Computers C-20 (1), 68–86.CrossRef Google Scholar

Zang, W, Jiang, Z and Ren, L (2017) Improved spectral clustering based on density combining dna genetic algorithm. International Journal of Pattern Recognition and Artificial Intelligence 31 (04), 1750010.CrossRef Google Scholar

Zhong, C, Miao, D and Fränti, P (2011) Minimum spanning tree based split-and-merge: A hierarchical clustering method. Information Sciences 181 (16), 3397–3410.CrossRef Google Scholar

Zhong, C, Miao, D and Wang, R (2010) A graph-theoretical clustering method based on two rounds of minimum spanning trees. Pattern Recognition 43 (3), 752–766.Google Scholar

Zhong, X and Enke, D (2017) A comprehensive cluster and classification mining procedure for daily stock market return forecasting. Neurocomputing 267, 152–168.CrossRef Google Scholar