Keeping Up With... Big Data

This edition of Keeping Up With... was written by Mark Bieraugel.

Mark Bieraugel is Business Librarian at California Polytechnic State University. Mark can be reached at mbieraug@calpoly.edu.

Introduction

Big data is here. It is coming to your world faster than you expect. Business, the tech world, and higher education are abuzz with discussions and predictions about big data. It is important to understand big data now because it affects libraries directly and tangentially; directly because your library can use big data tools to analyze your big data sets; and tangentially, as the faculty at your school will increasingly incorporate big data into their research.

What is Big Data?

Big data is characterized by three Vs: Volume, Velocity, and Variety. The first V, volume, is the easiest to understand. Big data differs from regular data in that the size of the data sets are huge. How huge? That depends on the industry or discipline, but big data is loosely defined as data that cannot be stored or analyzed by conventional hardware and software. Traditional software can handle megabyte and kilobyte sized data sets, while big data tools can handle terabyte and petabyte sized data sets. The second V, velocity, covers the speed in which data is created. Think of the speed in which someone can create a single tweet in Twitter, or post to Facebook, or how quickly thousands of remote sensors constantly measure and report on changing seawater temperatures. The third V, variety, makes big data sets more challenging to organize and analyze. Traditionally the type of data collected by business and researchers was strictly controlled and structured, such as data entered into a spreadsheet with specific rows and columns, nice and clean. Big data sets can contain unstructured data such as email messages, photographs, postings on internet forums, and even phone transcripts.  

Real Thing or Vaporware: Why Big Data Now?

Managing and analyzing big data sets was once the exclusive realm of the trinity of academia, big business, and national governments.  What is new is the hardware and software for analyzing big data is cheaper and hence more available to business, academia, and local governments. Also new is the ability to analyze big data in real time and to make predictions based on it. Early users of big data were born-digital firms that relied on analyzing large data sets to orchestrate their success like Facebook, LinkedIn, Google, and Twitter. A number of factors have converged to corral and effectively mine massive datasets. These factors include lower costs of commodity servers to house the data, the release of open source software tools to manage distributed computing, the creation of massive data sets, and the need for businesses and other entities to wring value out of the data they collect.

What Librarians Need to Know About Big Data

Because of its prevalence and potential impacts, librarians need to know the basics of big data and how it affects academic research. Business librarians need to know how companies leverage big data, how such data mining provides a competitive advantage, and how students might need to grapple with big data sets in future employment. Science librarians need to know how big data differs from other scientific data and the impact of emerging software and hardware used for its analysis. Humanities and Social Science librarians should know that big data is becoming more commonplace in their disciplines as well, and is no longer restricted to corpus linguistics. Librarians in all disciplines, in order to facilitate the research process, will need to be aware of how big data is used and where it can be found.

Big Data Curation

Librarians also need to embrace a role in making big datasets more useful, visible and accessible by creating taxonomies, designing metadata schemes, and systematizing retrieval methods. Digital archivists, data curators, and other types of librarians are also asked to advise their faculty on the storage and accessibility of big data sets. Penn State’s Mike Furlough notes that we as librarians know the value of traditional information sources, but what is the value of less finished data, so-called ‘raw data? We don’t really know the value of raw data, but key to understanding is that with new and powerful analytics, including information visualization tools, researchers can look at data in new ways and mine it for information other than what the original data was used for. 

Next Steps for Academic Libraries

Library administration and management should examine what types of big data sets their library could be gathering and analyzing using big data tools. Does your library have an opportunity to measure something new, some massive data set which previously was out of your reach because of software and hardware constraints? From the side of big data curation, could your library, as part of storing your faculty’s scholarly research and making it accessible, also store and mount your faculty’s raw research data for others to use?

You library could be gathering big data for analysis to help make data driven decisions. What types of big data could you use to make better decisions about collection development, updating public spaces, or tracking use of library materials through your learning management system? Or you could be the thought leader on big data curation at your institution by providing guidance to storing and making accessible big data sets. Now is the opportunity for your library to understand the issues and opportunities big data offers to researchers, administration, and the librarians at your institution.

Learn More About Big Data

Recommended Reading
“Data, Data, Everywhere.” The Economist. February 25, 2010. http://www.economist.com/node/15557443 – Reports on the shift from data scarcity to overabundance and the benefits and headaches that result.

Graham, Mark. “Big Data and the End of Theory.” The Guardian. March 9, 2012. http://www.guardian.co.uk/news/datablog/2012/mar/09/big-data-theory - A measured response to big data hype.

Press, Gil. “A Short History of Big Data.” Forbes. May 9, 2013. http://www.forbes.com/sites/gilpress/2013/05/09/a-very-short-history-of-... - Seventy years of big data history.

Big Data and the Academy
Bell, Steven. “Promise and Problems of Big Data.” Library Journal. March 13, 2013. http://lj.libraryjournal.com/2013/03/opinion/steven-bell/promise-and-pro...  - Cautionary article on big data ‘solutionism.’

Parry, Marc. “Big Data on Campus.” The New York Times. July 18, 2012. http://www.nytimes.com/2012/07/22/education/edlife/colleges-awakening-to... - How colleges are using big data to help students chose classes, retain them, and counsel those in need.

Schwartz, Meredith.”What Governmental Big Data May Mean For Libraries.” Library Journal.  May 30, 2013. http://lj.libraryjournal.com/2013/05/oa/what-governmental-big-data-may-m... - Government open data initiatives and how they affect libraries and data collection and retention.


Case Studies
Howard, Alex. “Predictive Data Analytics is Saving Lives and Taxpayer Dollars in New York City.” O’Reilly RADAR. June 26, 2012. http://strata.oreilly.com/2012/06/predictive-data-analytics-big-data-nyc... - How big data is helping city government be more effective and efficient.

Madrigal, Alexis. “The Perfect Milk Machine: How Big Data Transformed the Dairy Industry.” The Atlantic Monthly. May 1, 2012. http://www.theatlantic.com/technology/archive/2012/05/the-perfect-milk-m... - The impact of big data on cattle breeding.

Scherer, Michael. “How Obama’s Data Crunchers Helped Him Win.” Time. November 8, 2012. http://www.cnn.com/2012/11/07/tech/web/obama-campaign-tech-team - Covers how big data analytics helped Obama win the last election.


Privacy and Criticism
boyd, dannah and Kate Crawford. “Critical Questions for Big Data” Information, Communication & Society. May 10, 2012. http://www.tandfonline.com/doi/abs/10.1080/1369118X.2012.678878#preview – Microsoft researchers ask provocative questions about the use of big data.

Crawford, Kate. “Think Again: Big Data.” Foreign Policy. May 9, 2013. http://www.foreignpolicy.com/articles/2013/05/09/think_again_big_data - Discusses the limitations and potential downsides of data driven decision making using big data sets.

Croll, Alistair. “Big Data is Our Generation’s Civil Right’s Issue and We Don’t Know It.” O’Reilly RADAR. August 2, 2012. http://radar.oreilly.com/2012/08/big-data-is-our-generations-civil-right... - Examines how web ‘personalization’ might be another form of redlining or racial profiling.

Duhigg, Charles. “How Companies Learn Your Secrets.” New York Times. February 16, 2012.
http://www.nytimes.com/2012/02/19/magazine/shopping-habits.html?pagewant... - How Target uses big data to determine when their customers are pregnant.

Tutorials
“AMP Camp Big Data Bootcamp.”  AMPLab. Accessed May 18, 2013. http://ampcamp.berkeley.edu/big-data-mini-course-home/  - Hands-on mini course on big data from Berkeley’s AMPLab. Requires an Amazon EC2 account, and some technical expertise.

“Big Data Tutorial: Everything You Need to Know.” SearchStorage. Accessed May 20, 2013. http://searchstorage.techtarget.com/guides/Big-data-tutorial-Everything-...  -  From the basics to a deeper dive into more technical issues of big data.

Tariq, Mohammed. “Hadoop Toolbox: When to Use What.” SmartData Collective. April 27, 2013.  http://smartdatacollective.com/mtariq/120791/hadoop-toolbox-when-use-wha... – Reports on the set of software tools used for big data.

Sandboxes
A big data “sandbox” is a free method of using Hadoop and big data tools. Big data ‘sandboxes’ require downloading specific software .

“Cloudera QuickStart VM.” Cloudera.  Accessed May 22, 2013. https://ccp.cloudera.com/display/SUPPORT/Cloudera+QuickStart+VM - This sandbox requires a 64 bit host OS and 4 GB of total RAM.

“Get Started with Hadoop & Hortonworks Data Platform.” Hortonworks.  Accessed March 15, 2013. http://hortonworks.com/get-started/  - This sandbox requires a 64 bit host OS and 4 GB of total RAM.