Beyond MapReduce: New Requirements for Scalable Data Processing

doi:10.1017/CBO9780511844409.008

8 - Beyond MapReduce: New Requirements for Scalable Data Processing

Published online by Cambridge University Press: 05 December 2012

Bill Howe and

Magdalena Balazinska

Edited by

Ian Gorton and

Deborah K. Gracio

Show author details

Bill Howe: Affiliation:
University of Washington
Magdalena Balazinska: Affiliation:
University of Washington
Ian Gorton: Affiliation:
Pacific Northwest National Laboratory, Washington
Deborah K. Gracio: Affiliation:
Pacific Northwest National Laboratory, Washington

Book contents

Get access

Summary

Introduction and Background

The MapReduce programming model has had a transformative impact on dataintensive computing, enabling a single programmer to harness hundreds or thousands of computers for a single task and get up and running in a matter of hours. Processing with thousands of computers require a different set of design considerations dominate: I/O scalability, fault tolerance, and flexibility rather than absolute performance. MapReduce, and the open-source implementation Hadoop, are optimized for these considerations and have become very successful as a result.

It is difficult to quantify the popularity of the MapReduce framework directly, but one indication of the uptake is the frequency of the search term. Figure 8.1 illustrates the search popularity for terms “mapreduce” and “hadoop” over the period 2006 to 2012. We see a spike in popularity for the term “mapreduce” in late 2007, but more or less constant popularity since. For the term “hadoop,” however, we see a steady increase to about twelve times that of “mapreduce.”

These data suggest that MapReduce and Hadoop are generating interest, as seen from the number of downloads, successful startups [12, 19, 47], projects [41, 53, 57], and interest from the research community [15, 18, 62, 63, 72, 78]. These data suggest a significant increase in interest in both MapReduce and Hadoop.

The MapReduce framework provides a simple programming model for expressing loosely coupled parallel programs by providing two serial functions, Map and Reduce. The Map function processes a block of input producing a sequence of (key, value) pairs, while the Reduce function processes a set of values associated with a single key.

Type: Chapter
Information: Data-Intensive Computing
Architectures, Algorithms, and Applications
, pp. 180 - 234

DOI: https://doi.org/10.1017/CBO9780511844409.008 [Opens in a new window]

Publisher: Cambridge University Press

Print publication year: 2012

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Book purchase

Temporarily unavailable

References

1. Abouzeid, A., Bajda-Pawlikowski, K., Abadi, D., Silberschatz, A., and Rasin, A. “An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads.” In Proc. of the 35th VLDB Conf., 2009.Google Scholar

2. Abouzeid, A., Bajda-Pawlikowski, K., Abadi, D. J., Rasin, A., and Silberschatz, A. “HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads.” VLDB 2, no. 1 (2009): 922–33.Google Scholar

3. Amazon Elastic Compute Cloud (Amazon EC2). http://aws.amazon. com/ec2/. Accessed May 2012.

4. Amazon SimpleDB. http://www.amazon.com/simpledb/. Accessed May 2012.

5. Oceanic remote chemical analyzer (ORCA). http://armbrustlab. ocean.washington.edu/.

6. Bancilhon, F., and Ramakrishnan, R. “An Amateur's Introduction to Recursive Query Processing Strategies.” In SIGMOD Conference, pages 16–52, 1986.Google Scholar

7. Berger, M. J., and Bokhari, S. H. “A Partitioning Strategy for Nonuniform Problems on Multiprocessors.” Computers, IEEE Transactions on, C-36(5), 1987.Google Scholar

8. Boulos, J., and Ono, K. “Cost Estimation of User-Defined Methods in Object-Relational Database Systems.” SIGMOD Record, 28, no. 3 (1999).CrossRef Google Scholar

9. Cary, A., Sun, Z., Hristidis, V., and Rishe, N. “Experiences on processing spatial data with mapreduce.” In Proceedings of the 21st International Conference on Scientific and Statistical Database Management, SSDBM 2009, pages 302–19, Berlin, Heidelberg: Springer-Verlag, 2009.CrossRef Google Scholar

10. Chaiken, R., Jenkins, B., Larson, P.-Å., Ramsey, B., Shakib, D., Weaver, S., and Zhou, J.SCOPE: “Easy and Efficient Parallel Processing of Massive Data Sets. In Proc. of the 34th VLDB Conf., pages 1265–76, 2008.Google Scholar

11. Chaudhuri, S., and R. Narasayya, V. “Self-Tuning Database Systems: A Decade of Progress.” In Proc. of the 33rd VLDB Conf., pages 3–14, 2007.Google Scholar

12. Cloudera. http://www.cloudera.com/.

13. Nsf cluster exploratory program. http://www.nsf.gov/pubs/2008/nsf08560/nsf08560.htm. Accessed July 7, 2010.

14. Cohen, J., Dolan, B., Dunlap, M., Hellerstein, J. M., andWelton, C. “MAD skills: New Analysis Practices for Big Data.” Proc. of the VLDB Endowment, 2, no. 2 (2009): 1481–92.CrossRef Google Scholar

15. Condie, T., Conway, N., Alvaro, P., Hellerstein, J. M., Elmeleegy, K., and Sears, R. “MapReduce Online.” In Symposium on Networked Systems Design and Implementation, pages 21–21, 2010.Google Scholar

16. CouchDB. http://couchdb.apache.org/. Accessed May 2012.

17. Dageville, B., Das, D., Dias, K., Yagoub, K., Mohamed, Z., and Mohamed, Z. “Automatic SQL Tuning in Oracle 10g.” In Proc. of the 30th VLDB Conf., pages 1098–1109, 2004.Google Scholar

18. Das, S., Sismanis, Y., Beyer, K. S.Gemulla, R.,Haas, P. J., and McPherson, J. “Ricardo: Integrating R and Hadoop.” In Proc. of the ACMSIGMOD International Conference on Management of Data, pages 987–98, 2010.Google Scholar

19. Datameer. http://www.datameer.com./. Accessed May 2012.

20. Davis, M., Efstathiou, G., Frenk, C. S., and White, S. D. M. “The Evolution of Large-Scale Structure in a Universe Dominated by Cold Dark Matter.” Astroph. J. 292 (May 1985): 371–94.Google Scholar

21. Dean, J. and Ghemawat, S.. MapReduce: “Simplified Data Processing on Large Clusters.” In Proc. of the 6th OSDI Symp., 2004.Google Scholar

22. Dean, J. and Ghemawat, S. “MapReduce: Simplified Data Processing on Large Clusters.” In OSDI, pages 137–50, 2004.Google Scholar

23. Deshpande, A., Ives, Z., and Raman, V.Adaptive Query Processing. Foundations and Trends in Databases 1, no. 1 (2007): 139.CrossRef Google Scholar

24. Devine, K., Boman, E., and Karypis, G.Parallel Processing for Scientific Computing, chapter 6. Society for Industrial and Applied Mathematics, 2006.Google Scholar

25. Devine, K., Boman, E., Heapby, R., Hendrickson, B., and Vaughan, C.Zoltan Data Management Service for Parallel Dynamic Applications. Computing in Science and Eng. 4, no. 2 (2002): 90–96.CrossRef Google Scholar

26. DeWitt, D., and Jim, G.Parallel Database Systems: The Future of High Performance Database Systems. CACM, 35, no. 6 (1992): 85–98.CrossRef Google Scholar

27. Dewitt, D., and Stonebraker, M. MapReduce: A major step backwards. http://databasecolumn.vertica.com/database-innovation/mapreduce-a-major-stepbackwards/.

28. DeWitt, D. J., and Gray, J.Parallel Database Systems: The Future of High Performance Database Systems. Commun. ACM 35, no. 6 (1992): 85–98.CrossRef Google Scholar

29. DeWitt, D. J., Naughton, J. F., Schneider, D. A., and Seshadri, S. “Practical Skew Handling in Parallel Joins.” In Proc. of the 18th VLDB Conf., 1992.Google Scholar

30. Ekanayake, J., and Pallickara, S. “MapReduce for Data Intensive Scientific Analysis.” In IEEE eScience, pages 277–84, 2008.Google Scholar

31. Fisher, K.,Walker, D., and Zhu, K. Q. “Learnpads: Automatic Tool Generation from Ad Hoc Data.” In SIGMOD Conference, pages 1299–1302, 2008.Google Scholar

32. Fisher, K., Walker, D., Zhu, K. Q., and White, P. “From Dirt to Shovels: Fully Automatic Tool Generation from Ad Hoc Data.” In POPL, pages 421–34, 2008.Google Scholar

33. Gelb, J. M., and Bertschinger, E.Cold DarkMatter. 1: The Formation of Dark Halos. Astroph. J. 436 (December 1994): 467–90.CrossRef Google Scholar

34. Ghemawat, S., Gobioff, H., and Leung, S.-T. “The Google File System.” In Proc. of the 19th SOSP Symp., pages 29–43, 2003.Google Scholar

35. Graham, R. L.Bounds on Multiprocessing Timing Anomalies. SIAM Journal on Applied Mathematics, 17, no. 2 (1969): 416–29.CrossRef Google Scholar

36. Greenplum database. http://www.greenplum.com/. Accessed May 2012.

37. Hadoop. Accessed July 7, 2010. http://hadoop.apache.org/.

38. Hadoop – capacity scheduler guide. http://hadoop.apache.org/mapreduce/docs/current/capacity_scheduler.html.

39. Hdfs. Accessed July 7, 2010. http://hadoop.apache.org/common/docs/current/hdfsdesign.html.

40. Bingsheng, H., Yang, M., Guo, Z., Chen, R., Su, B., Lin, W., and Zhou, L. “Comet: Batched Stream Processing for Data Intensive Distributed Computing.” In Proc. of the 1st ACM symposium on Cloud computing, pages 63–74, 2010.Google Scholar

41. Hive. http://hadoop.apache.org/hive/.

42. Howe, B., and Maier, D. “Algebraic Manipulation of Scientific Datasets.” In VLDB '04: Proceedings of the 30th International Conference on Very Large Data Bases, Toronto, Ontario, CA, 2004.Google Scholar

43. Howe, B., Maier, D., and Bright, L. “Smoothing the Roi Curve for Scientific Data Management Applications.” In Proc. of the Third CIDR Conf., 2007.Google Scholar

44. Hua, K. A., and Lee, C. “Handling Data Skew in Multiprocessor Database Computers Using Partition Tuning. In Proc. of the 17th VLDB Conf., 1991.Google Scholar

45. Isard, M., Budiu, M., Yu, Y., Birrell, A., and Fetterly, D. “Dryad: Distributed Dataparallel Programs from Sequential Building Blocks.” In Proc. of the EuroSys Conf., pages 59–72, 2007.Google Scholar

46. Isard, M., Budiu, M., Yu, Y., Birrell, A., and Fetterly, D. “Dryad: Distributed Data-Parallel Programs From Sequential Building Blocks.” In EuroSys, pages 59–72, 2007.Google Scholar

47. Karmasphere. http://www.karmasphere.com/. Accessed May 2012.

48. Kleinberg, J. M.Authoritative Sources in a Hyperlinked Environment. J. ACM, 46 no. 5 (1999): 604–32.CrossRef Google Scholar

49. Knollmann, S. R., and Knebe, A. “AHF: Amiga's Halo Finder.” Astroph. J. Suppl. 182 (June 2009): 608–24.CrossRef Google Scholar

50. Kossmann, D., Franklin, M. J., Drasch, G., and Ag, W.Cache Investment: Integrating Query Optimization and Distributed Data Placement. ACM TODS 25, no. 4 (2000): 517–58.CrossRef Google Scholar

51. Kwon, et al. Scalable Clustering Algorithm for N-Body Simulations in a Shared-Nothing cluster. Technical Report UW-CSE-09-06-01, Dept. of Comp. Sci., Univ. of Washington, 2009.Google Scholar

52. Lin, J. “The Curse of Zipf and Limits to Parallelization: A Look at the Stragglers Problem in MapReduce.” In 7th Workshop on Large-Scale Distributed Systems for Information Retrieval, 2009.Google Scholar

53. Mahout. http://lucene.apache.org/mahout/. Accessed July 7, 2010.

54. Malewicz, G., Austern, M. H., Bik, A. J. C., Dehnert, J. C., Horn, I., Leiser, N., and Czajkowski, G. “Pregel: A System for Large-Scale Graph Processing.” In SIGMOD Conference, pages 135–46, 2010.Google Scholar

55. Oliker, L., and Biswas, R.Plum: Parallel Load Balancing for Adaptive Unstructured Meshes. J. Parallel Distrib. Comput. 52, no. 2 (1998): 150–77.CrossRef Google Scholar

56. Olston, C., Bortnikov, E., Elmeleegy, K., Junqueira, F., and Reed, B. “Interactive Analysis of Web-Scale Data.” In Fourth CIDR Conf. – Perspectives, 2009.Google Scholar

57. Olston, C., Reed, B., Srivastava, U., Kumar, R., and Tomkins, A. “Pig Latin: A Not-So-Foreign Language for Data Processing.” In SIGMOD Conference, pages 1099–10, 2008.Google Scholar

58. Olston, C., Reed, B., Srivastava, U., Kumar, R., and Tomkins, A.. “Pig Latin: A Not-So-Foreign Language for Data Processing.” In Proc. of the SIGMOD Conf., pages 1099–10, 2008.Google Scholar

59. Oracle. http://www.oracle.com/database/.

60. Page, L., Brin, S., Motwani, R., and Winograd, T. The PageR-ank Citation Ranking: Bringing Order to the Web. Technical Report 1999–1966, Stanford InfoLab, 1999.

61. Parashar, M., Liu, H., Li, Z., Matossian, V., Schmidt, C., Zhang, G., and Hariri, S.AutoMate: Enabling Autonomic Applications on the Grid. Cluster Computing 9, no. 2 (2006): 48–57.CrossRef Google Scholar

62. Pavlo, A., Paulson, E., Rasin, A., Abadi, D. J., DeWitt, D. J., Madden, S., and Stonebraker, M. “A Comparison of Approaches to Large-Scale Data Analysis.” In SIGMOD Conference, pages 165–78, 2009.Google Scholar

63. Pavlo, A., Paulson, E., Rasin, A., Abadi, D. J., DeWitt, D. J., Madden, S., and Stonebraker, M. “A Comparison of Approaches to Large-Scale Data Analysis.” In Proc. of the SIGMOD Conf., pages 165–78, 2009.Google Scholar

64. Qiu, X., Ekanayake, J., Beason, S., Gunarathne, T., Fox, G., Barga, R., and Gannon, D. “Cloud Technologies for Bioinformatics Applications.” In MTAGS '09: Proceedings of the 2nd Workshop on Many-Task Computing on Grids and Supercomputers, pages 1–10, 2009.Google Scholar

65. Shatdal, A. and Naughton, J. “Adaptive Parallel Aggregation Algorithms.” In Proc. of the SIGMOD Conf., 1995.Google Scholar

66. Springel, V., White, S. D. M., Jenkins, A., Frenk, C. S., Yoshida, N., Gao, L., Navarro, J., Thacker, R., Croton, D., Helly, J., Peacock, J. A., Cole, S., Thomas, P., Couchman, H., Evrard, A., Colberg, J., and Pearce, F. “Simulations of the Formation, Evolution and Clustering of Galaxies and Quasars.” NATURE 435 (June 2005): 629–36.CrossRef Google Scholar PubMed

67. Sql server. http://www.microsoft.com/sqlserver/. Accessed May 2012.

68. Stadel, J. G.. Cosmological N-body Simulations and Their Analysis. PhD thesis, University of Washington, 2001.Google Scholar

69. Stonebraker, M., Abadi, D. J., DeWitt, D. J., Madden, S., Paulson, E., Pavlo, A., and Rasin, A.Mapreduce and Parallel Dbmss: Friends or Foes?CACM, 53, no. 1 (January 2010).CrossRef Google Scholar

70. Stonebraker, M., Becla, J., DeWitt, D., Lim, K.-T., Maier, D., Ratzesberger, O., and Zdonik, S. “Requirements for Science Data Bases and SciDB.” In Fourth CIDR Conf. – Perspectives, 2009.Google Scholar

71. Teradata. http://www.teradata.com/.

72. Thusoo, A., Shao, Z., Anthony, S., Borthakur, D., Jain, N., Sarma, J. S., Murthy, R., and Liu, H. “Data Warehousing and Analytics Infrastructure at Facebook.” In Proc. of the ACM SIGMOD International Conference on Management of Data, pages 1013–20, 2010.Google Scholar

73. Walton, C. B., Dale, A. G., and Jenevein, R. M. “A Taxonomy and Performance Model of Data Skew Effects in Parallel Joins.” In Proc. of the 17th VLDB Conf., 1991.Google Scholar

74. Snodgrass, R. T., Li, W., and Gao, D. “Skew Handling Techniques in Sort-Merge Join.” In Proc. of the SIGMOD Conf., 2002.Google Scholar

75. Weinberg, D. H.,Hernquist, L., and Katz, N. “Photoionization, Numerical Resolution, and Galaxy Formation.” Astroph. J. 477 (March 1997): 8−+.CrossRef Google Scholar

76. Xldb workshop. http://www-conf.slac.stanford.edu/xldb/. Accessed May 2012.

77. Xu, Y., and Kostamaa, P. “Efficient Outer Join Data Skew Handling in Parallel DBMS.” In VLDB, 2009.Google Scholar

78. Xu, Y., Kostamaa, P., and Gao, L. “Integrating Hadoop and Parallel DBMs.” In Proc. of the ACM SIGMOD International Conference on Management of Data, pages 969–74, 2010.Google Scholar

79. Xu, Y., Kostamaa, P., Zhou, X., and Chen, L. “Handling Data Skew in Parallel Joins in Shared-Nothing Systems.” In Proc. of the SIGMOD Conf., pages 1043–52, 2008.Google Scholar

80. Yu, Y., Gunda, P. K., and Isard, M. “Distributed Aggregation for Data-Parallel Computing: Interfaces and Implementations.” In Proc. of the 22nd SOSP Symp., 2009.Google Scholar

81. Zhang, W., Wang, K., and Chau, S.-C. “Data Partition and Parallel Evaluation of Datalog Programs.” IEEE Trans. Knowl. Data Eng., 7, no. 1 (1995): 163–76.CrossRef Google Scholar

82. Zhang, Y., Herodotou, H., and Yang, J.RIOT: I/O-Efficient Numerical Computing Without SQL. In Proc. of the Fourth CIDR Conf., 2009.Google Scholar