Skip to main content Accessibility help
×
Hostname: page-component-78c5997874-fbnjt Total loading time: 0 Render date: 2024-11-05T23:24:17.055Z Has data issue: false hasContentIssue false

5 - Large-Scale Data Management Techniques in Cloud Computing Platforms

Published online by Cambridge University Press:  05 December 2012

Sherif Sakr
Affiliation:
National ICT Australia (NICTA), University of New SouthWales
Anna Liu
Affiliation:
National ICT Australia (NICTA), University of New South Wales
Ian Gorton
Affiliation:
Pacific Northwest National Laboratory, Washington
Deborah K. Gracio
Affiliation:
Pacific Northwest National Laboratory, Washington
Get access

Summary

Introduction

In the last two decades, the continuous increase of computational power has produced an overwhelming flow of data, which called for a paradigm shift in the computing architecture and large scale data processing mechanisms. In a speech given just a few weeks before he was lost at sea off the California coast in January 2007, Jim Gray, a database software pioneer and a Microsoft researcher, called the shift a “fourth paradigm” [32]. The first three paradigms were experimental, theoretical and, more recently, computational science. Gray argued that the only way to cope with this paradigm is to develop a new generation of computing tools to manage, visualize, and analyze the data flood. In general, the current computer architectures are increasingly imbalanced where the latency gap between multicore CPUs and mechanical hard disks is growing every year, which makes the challenges of data-intensive computing harder to overcome [6]. Therefore, there is a crucial need for a systematic and generic approach to tackle these problems with an architecture that can also scale into the foreseeable future. In response, Gray argued that the new trend should instead focus on supporting cheaper clusters of computers to manage and process all this data instead of focusing on having the biggest and fastest single computer. Figure 5.1 illustrates an example of the explosion in scientific data, which creates major challenges for cutting-edge scientific projects. For example, modern high-energy physics experiments, such as DZero, typically generate more than one terabyte of data per day.

Type
Chapter
Information
Data-Intensive Computing
Architectures, Algorithms, and Applications
, pp. 85 - 123
Publisher: Cambridge University Press
Print publication year: 2012

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

1. Abadi, D.Data Management in the Cloud: Limitations and Opportunities.” IEEE Data Eng. Bull. 32, no. 1 (2009): 3–12.Google Scholar
2. Abouzeid, A, Bajda-Pawlikowski, K., Abadi, D, Rasin, A, and Silberschatz, A.Hadoopdb: An Architectural Hybrid of Mapreduce and Dbms Technologies for Analytical Workloads.” PVLDB 2, no. 1 (2009): 922–33.Google Scholar
3. Abouzeid, A, K., Bajda-Pawlikowski, Huang, J, Abadi, D, and Silberschatz, A. “HadoopDB in Action: Building Real World Applications.” In SIGMOD, 2010.Google Scholar
4. Armbrust, M., Fox, A, Rean, G, Joseph, A, Katz, R, Konwinski, A, Gunho, L, David, P., Rabkin, A, Stoica, I, and Zaharia, M.Above the Clouds: A Berkeley View of Cloud Computing. Feb. 2009.Google Scholar
5. Tam, E, Ramakrishnan, R, Cooper, B, Silberstein, A, and Sears, R. “Benchmarking Cloud Serving Systems with YCSB.” In ACM SoCC, 2010.Google Scholar
6. Bell, G,Gray, J, and Szalay, A.PetascaleComputational Systems.” IEEE Computer 39, no. 1 (2006): 110–12.CrossRefGoogle Scholar
7. Bernstein, P, Cseri, I, Dani, N, N., Ellis, Kalhan, A, Kakivaya, G, Lomet, D, Manne, R., Novik, L, and Talius, T. “Adapting Microsoft SQL Server for Cloud Computing.” In ICDE, pages 1255–1263, 2011.Google Scholar
8. Binnig, C, Kossmann, D, Kraska, T, and Loesing, S. “How is the Weather Tomorrow?: Towards a Benchmark for the Cloud.” In DBTest, 2009.Google Scholar
9. Brantner, M, Florescu, D, Graf, D, Kossmann, D, and Kraska, T. “Building a Database on S3.” In SIGMOD, pages 251–264, 2008.Google Scholar
10. Brewer, ETowards Robust Distributed Systems (abstract). In PODC, page 7, 2000.Google Scholar
11. Bu, Y, Howe, B, Balazinska, M, and Ernst, MHaLoop: Efficient Iterative Data Processing on Large Clusters. PVLDB 3, no. 1 (2010): 285–96.Google Scholar
12. Burrows, MThe Chubby Lock Service for Loosely-Coupled Distributed Systems. In OSD, pages 335–350, 2006.Google Scholar
13. Cary, A, Sun, Z, Hristidis, V, and Rishe, N. “Experiences on Processing Spatial Data with MapReduce.” In SSDBM, pages 302–319, 2009.Google Scholar
14. Deepak, T Chandra, Griesemer, R, and Redstone, JPaxos made live: an engineering perspective. In PODC, pages 398–407, 2007.Google Scholar
15. Chang, F, Dean, J, Ghemawat, S, Hsieh, W, Wallach, D, Burrows, M, Chandra, T, Fikes, A, and Gruber, R.Bigtable: A Distributed Storage System for Structured Data.” ACM Trans. Comput. Syst. 26, no. 2 (2008).CrossRefGoogle Scholar
16. Chen, R, Weng, X, He, B, and Yang, M. “Large Graph Processing in the Cloud.” In SIGMOD, pages 1123–1126, 2010.Google Scholar
17. Cooper, B, Baldeschwieler, E, Fonseca, R, Kistler, J, Narayan, P, Neerdaels, C, Negrin, T, Ramakrishnan, R, Silberstein, A, Srivastava, U, and Stata, RBuilding a Cloud for Yahoo!IEEE Data Eng. Bull. 32, no. 1 (2009): 36–43.Google Scholar
18. Cooper, B, Ramakrishnan, R, Srivastava, U, Silberstein, A, Bohannon, P, H., Jacobsen, Puz, N, Weaver, D, and Yerneni, R.Pnuts: Yahoo!'s Hosted Data Serving Platform.” PVLDB 1, no. 2 (2008): 1277–88.Google Scholar
19. Das, S, Sismanis, Y, Beyer, K, Gemulla, R, Haas, P, and McPherson, J. “Ricardo: Integrating R and Hadoop.” In SIGMOD, pages 987–998, 2010.Google Scholar
20. Dean, J, and Ghemawat, S. “Mapreduce: Simplified Data Processing on Large Clusters.” In OSDI, pages 137–150, 2004.Google Scholar
21. Dean, J, and Ghemawat, SMapreduce: Simplified Data Processing on Large Clusters. Commun. ACM 51, no. 1 (2008): 107–13.CrossRefGoogle Scholar
22. DeCandia, G, Hastorun, D, Jampani, M, Kakulapati, G, Lakshman, A, Pilchin, A., Sivasubramanian, S, Vosshall, P, and Vogels, W. “Dynamo: Amazon's Highly Available Key-Value Store.” In SOSP, pages 205–220, 2007.Google Scholar
23. Deelman, E, Singh, G, Livny, M, Berriman, G, and Good, J. “The Cost of Doing Science on the Cloud: The Montage Example.” In SC, page 50, 2008.Google Scholar
24. Dittrich, J, Quiané-Ruiz, J, Jindal, A, Kargin, Y, Setty, V, and Schad, JHadoop++: Making a Yellow Elephant Run Like a Cheetah (Without It Even Noticing). PVLDB, 3, no. 1 (2010): 518–29.Google Scholar
25. Foster, IandKesselman, CThe Grid: Blueprint for a New Computing Infrastructure. Morgan Kaufmann, 1999.Google Scholar
26. Friedman, E, Pawlowski, P, and Cieslewicz, JSql/mapreduce: A Practical Approach to Self-Describing, Polymorphic, and Parallelizable User-defined Functions. PVLDB 2, no. 2 (2009): 1402–13.Google Scholar
27. ,Gartner. Gartner top ten disruptive technologies for 2008 to 2012. Emerging trends and technologies roadshow, 2008.
28. Gates, A, Natkovich, O, Chopra, S, Kamath, P, Narayanam, S, Olston, C, Reed, B, Srinivasan, S, and Srivastava, U.Building a Highlevel Dataflow System on Top of Mapreduce: The Pig Experience.” PVLDB 2, no. 2 (2009): 1414–25.Google Scholar
29. Ghemawat, S, Gobioff, H, and Leung, SThe Google File System. In SOSP, pages 29–43, 2003.Google Scholar
30. Gilbert, S and Lynch, NBrewer's Conjecture and the Feasibility of Consistent, available, partition-tolerant web services. SIGACT News, 33(2): 51–59, 2002.CrossRefGoogle Scholar
31. Gonzalez, L, Merino, L, Caceres, J, and Lindner, M.A Break in the Clouds: Towards a Cloud Definition.” Computer Communication Review 39, no. 1 (2009): 50–5.Google Scholar
32. Hey, T, Tansly, S, and Tolle, K, eds. The Fourth Paradigm: Data-Intensive Scientific Discovery. Microsoft Research, October 2009.
33. Karger, D, Lehman, E, Leighton, F, Panigrahy, R, Levine, M, and Lewin, D. “Consistent Hashing and Random Trees: Distributed Caching Protocols for Relieving Hot Spots on the World Wide Web.” In STOC, pages 654–663, 1997.Google Scholar
34. Kossmann, D, Kraska, T, and Loesing, S. “An Evaluation of Alternative Architectures for Transaction Processing in the Cloud.” In SIGMOD, 2010.Google Scholar
35. Lakshman, A, and Malik, P. “Cassandra: Structured Storage System on a p2p Network.” In PODC, page 5, 2009.Google Scholar
36. Lu, W, Jackson, J, and Barga, R. “AzureBlast: a case study of developing science Applications on the Cloud.” In HPDC, pages 413–420, 2010.Google Scholar
37. Malewicz, G, Austern, M, Bik, A, Dehnert, J, Horn, I, Leiser, N, and Czajkowski, GPregel: A System for Large-Scale Graph Processing. In SIGMOD, pages 135–146, 2010.Google Scholar
38. Nykiel, T, Potamias, M, Mishra, C, Kollios, G, and Koudas, N.MRShare: Sharing Across Multiple Queries in MapReduce.” PVLDB 3, no. 1 (2010): 494–505.Google Scholar
39. Olston, C, Reed, B, Srivastava, U, Kumar, R, and Tomkins, A. “Pig Latin: A Not-So-Foreign Language for Data Processing.” In SIGMOD, pages 1099–1110, 2008.Google Scholar
40. Pavlo, A, Paulson, E, Rasin, A, Abadi, D, DeWitt, D, Madden, S, and, MStonebraker. “A Comparison of Approaches to Large-Scale Data Analysis.” In SIGMOD, pages 165–178, 2009.Google Scholar
41. Stonebraker, M.The Case for Shared Nothing.” IEEE Database Eng. Bull. 9, no. 1 (1986): 4–9.Google Scholar
42. Stonebraker, M, Abadi, D, DeWitt, D, Madden, S, Paulson, E, Pavlo, A, and Rasin, A.MapReduce and Parallel DBMSs: Friends or Foes?Commun. ACM 53, no. 1 (2010): 64–71.CrossRefGoogle Scholar
43. Alvaro, P, Hellerstein, J, Elmeleegy, K, Condie, T, Conway, N, and Sears, R. “Mapre-duce Online.” In NSDI, 2010.Google Scholar
44. Tanenbaum, A, and Steen, M., eds. Distributed Systems: Principles and Paradigms. Prentice Hall, 2002.
45. Thusoo, A, Sarma, J, Jain, N, Shao, Z, Chakka, P, Anthony, S, Liu, H, Wyckoff, P, and Murthy, R.Hive – A Warehousing Solution Over a Map-reduce Framework.” PVLDB 2, no. 2 (2009): 1626–29.Google Scholar
46. Thusoo, A, Sarma, J, Jain, N, Shao, Z, Chakka, P, Zhang, N, Anthony, S, Liu, H, andMurthy, R. “Hive – A Petabyte Scale DataWarehouse Using Hadoop.” In ICDE, pages 996–1005, 2010.Google Scholar
47. Vogels, WEventually consistent. Commun. ACM 52, no. 1 (2009): 40–44.CrossRefGoogle Scholar
48. Wang, C, Wang, J, Lin, X, Wang, W, Wang, H, Li, H, Tian, W, Xu, J, and R., Li. “MapDupReducer: Detecting Near Duplicates Over Massive Datasets.” In SIGMOD, pages 1119–1122, 2010.Google Scholar
49. Xu, Y, Kostamaa, P, and Gao, L. “Integrating Hadoop and Parallel Dbms.” In SIGMOD, pages 969–974, 2010.Google Scholar
50. Yang, H, Dasdan, A, Hsiao, R, and Parker, D. “Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters.” In SIGMOD, pages 1029–1040, 2007.Google Scholar

Save book to Kindle

To save this book to your Kindle, first ensure [email protected] is added to your Approved Personal Document E-mail List under your Personal Document Settings on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part of your Kindle email address below. Find out more about saving to your Kindle.

Note you can select to save to either the @free.kindle.com or @kindle.com variations. ‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi. ‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.

Find out more about the Kindle Personal Document Service.

Available formats
×

Save book to Dropbox

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Dropbox.

Available formats
×

Save book to Google Drive

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Google Drive.

Available formats
×