No CrossRef data available.
Published online by Cambridge University Press: 16 December 2024
Nutrition research relies on food databases which are extensively used in dietary surveys, clinical practice, research, and policy development (1). Online data volume is expected to increase up to 180 zettabytes by 2025, due to a proliferation of internet-connected devices, the growth of social media platforms, and a digital transformation of industries (2). Webscraping, a method to extract data from websites, has been previously used in Ireland to evaluate online retailer information as a potential source for monitoring food reformulation efforts in the Irish retail market (3). This study aims to outline a process for, and evaluate the use of, webscraping on online supermarket websites to increase data availability to researchers.
An online supermarket website was selected to trial the new process. Octoparse software version 8 was downloaded. 12 data fields of interest were identified; cost, lifestyle, net weight, Directions for use, Storage instructions, Nutrition information, Front of pack information, legal name, brand name, manufacturer, ingredients, and allergy advice. A process was defined for data web scraping in four main steps; 1) collection of category level URL’s, 2) collection of product level URL’s, 3) collection of data at product level within defined fields and 4) data cleaning and re-structuring. A workflow was created in Octoparse for steps i - iii and step iv was completed using Excel version 16.69.1.
83 category level page links were generated and entered into Octoparse software. Webscraping was completed on 3,095 product level URLs. Data on 1,450 products (47%) were successfully scraped as they had data within the 12 defined data fields. A new dataset was created for the 1,450 products with data fields including information on nutrition (energy, fat, of which saturates, carbohydrate, of which sugars, fibre, protein and salt), costs per serving and per kg, lifestyle factors (e.g. whether a product was vegetarian or vegan), ingredient lists and allergy advice. 637 products (44%) were found to have vegetarian/vegan claims. Micronutrient level data was limited.
An increased availability of online data presents an opportunity for the development of new and more systematically updated datasets, and may increase the availability of information on branded products. Webscraping enables researchers to create new databases, and systematically update datasets, with less resources. This study enhances the availability of data and may enable researchers to explore new avenues for understanding food environments. Future research should test the process on additional websites to increase coverage of the Irish retail market and across different regions, identify sources with more in-depth nutritional data, and evaluate use case in mobile applications. Web scraping offers a promising tool for advancing research in food science and nutrition, and providing access to diverse datasets for research and innovation that change with the times.