TER Volume 14, Number 1, June 2007: Review of Data Crunching

Technology Electronic Reviews
Volume 14, Number 1, June 2007

~ Return to more reviews in this issue

REVIEW OF: Greg Willson (2005). Data Crunching: Solve Everyday Problems Using Java, Python, and More. Raleigh, NC: The Pragmatic Bookshelf. (ISBN: 0974514071). 176pp. $29.95

By Scott Rice

Data Crunching: Solve Everyday Problems Using Java, Python, and More is another in the Pragmatic Programmer’s series of books that Written by Greg Wilson, a Ph.D. in computer science and adjunct professor at the University of Toronto, this guide attempts to highlight and give tools for conquering the less appealing side of computer programming, data manipulation. The Skill Range guide on the back of the book advertises the text as being suitable for those who are about halfway between Beginner and Expert.

Chapter 1 is the introduction to the themes of the book and also talks about the programming tools and languages that will be addressed, which include Unix, Python, Java, XSLT, and SQL. Two examples of data crunching are presented in the chapter as illustrations of the need for data crunching skills and the difficulties involved. The first discusses converting lines of data describing molecules in a Protein Data Bank format into a format called VU3. The second involves grades from students submitted in non-standard formats.

Chapter 2 covers manipulating text strings using Python. The molecule description example is used again here to illustrate some of the techniques. Data dictionaries are explained and Unix shell commands are covered, with a handy list of popular commands provided.

Chapter 3 is an explanation and examples of regular expressions. This is one of the more useful chapters in the book, as it provides a very clear discussion of how regular expressions work and how to use them. Numerous real-life examples of using regular expressions are covered, including email addresses, Canadian postal codes, 24 hour times, and feet and inch conversions.

XML is the topic covered in Chapter 4 and includes a discussion of SAX, Xpath, and XSLT as well. A brief history of XML is given, as well as a basic discussion of the formatting rules. Normalization and error handling are covered in the section on SAX, which is the Simple API for XML. The Document Object Model (DOM) is also covered in this chapter. Through the whole chapter, the author also builds an XML and XSLT files to show the proper methodology. This chapter was also very useful although it is another example of the unstated theme of this book, which is trying to fit too much information in a too-small space.

Chapter 5 is the chapter on binary data and how to store it and manipulate it. This seems like the least useful chapter, as I find it hard to imagine that intermediate programmers would be using binary data on a regular or even irregular basis.

Chapter 6 is about relational databases and includes a lot of information about SQL. In fact, he covers just about every topic someone would need to use SQL to extract and manipulate information in databases. Queries, joins, nesting, negation, aggregation, views, creating and deleting tables, and inserting, updating, and deleting rows in a table are all covered in a whirlwind tour. This seems like the chapter’s biggest strength and its biggest weakness. There is a lot of information covered very quickly, and it is a LOT of information covered VERY quickly.

"Horseshoe Nails" is the title of chapter 7, and it covers a series of disconnected topics, including string input and output, encoding and decoding, floating point arithmetic, and sampling and auditing. This is meant to be the 'catch-all' chapter for all those smaller topics that do not fit in the other chapters, but are still important to a well-rounded discussion of data crunching tools and techniques.

There is a table of contents and an index and each chapter includes a "lessons learned" section with some helpful summary information. The book has some very useful techniques and succinct explanations of difficult-to-master concepts such as regular expressions. Entire books have been written about regular expressions, but I think it is possible to get up and running using them, at least at a basic level, just based on this chapter. Other chapters that encapsulate useful information in a succinct manner are the chapters on SQL and XML.

It is difficult to say who would be the ideal reader for this book. The suggestion of intermediate expertise provided on the back of the book does not seem quite right as a description of the person who would enjoy this book, but rather an average of the skill levels that would appreciate it. Some portions of the book seem to be too elementary, while others are quite complex. As someone with expertise somewhere between beginner and intermediate, I found myself in that position, reading quickly through parts of the material that I was already quite familiar with and a little at sea in areas that were well beyond my current expertise.

As is the case with many texts on computer programming, a reader will probably not read straight through the book, but pick and choose from the parts that are most interesting or most immediately useful. The best use for this book would probably be as a reference on some basic techniques of data crunching. For all its flaws, I can still recommend it as a book that provides a lot of information in a very succinct manner, with many examples to guide the reader. I can imagine a patron thumbing to those parts of the book that apply to the current task they are working on, and getting a little guidance in how to proceed, which may be exactly what the author had in mind.

Scott Rice is Networked Information Services Librarian at the University of North Carolina Greensboro.

Copyright © 2007 by Scott Rice. This document may be reproduced in whole or in part for noncommercial, educational, or scientific purposes, provided that the preceding copyright statement and source are clearly acknowledged. All other rights are reserved. For permission to reproduce or adapt this document or any part of it for commercial distribution, address requests to the author.


Technology Electronic Reviews (TER) is an irregular electronic serial publication of the Library and Information Technology Association, a division of the American Library Association, 50 E. Huron St., Chicago, IL 60611. The primary function of TER is to provide reviews of and pointers to a variety of print and electronic resources about information technology. Resources include books, articles, serials, discussion lists, training materials, bibliographies, and other items of interest to librarians and information technology professionals. The topics covered may include, but are not limited to, networking technologies and standards; hardware and software; operating systems; databases; specific programming languages; management tools and utilities; technical project management; training and personnel issues; library perspectives; and research and development.

Opinions expressed in this publication are those of the writers and do not necessarily represent the viewpoints of LITA, ALA, or organizations involved in the storage and/or distribution of the publication.

TER is distributed electronically via Internet. There is no subscription fee.


LITA provides its members, other ALA divisions and members, and the library and information science field as a whole with a forum for discussion, an environment for learning, and a program for action on the design, development, and implementation of automated and technological systems in the library and information science field.


LITA home page | TER home page