Here are some impressive statistics: Our society generates as much data today as we did leading up until 2003, and over 90% of the world’s data was produced in the last two years. The advent of social media, online banking and trading, and increased technology development in biomedical sciences, finance, political science, and many other fields, gives credibility to the above claims. A significant number of people have a smartphone and/or laptop and/or iPad, all of which generate massive amounts of data everyday. In the news, we hear about data collection and privacy invasion by the NSA, big phone companies such as AT&T, and tech companies such as Google. But we also hear about how our data helps predict elections, consumer consumption, disease diagnosis, and many more applications that improve our society in some way. With all this talk about data, and the good and bad surrounding them, most of us find it a vague, intangible buzz word. So what is data anyway?
Simply put, it’s information, facts and numbers, whether it’s your tweets, stock trades, iTunes music library, or a table of numbers. Anything that describes something else, usually in a condensed form such as a database or matrix, makes up data. When one has a lot of data, like what’s generated from the Large Hadron Collider or next generation sequencing, one has so-called “Big Data.” Data is usually quantified as 0s and 1s, and the amount of 0s and 1s, or bits, are grouped together into 8 bits called a byte. You may be familiar with kilobytes (KB), megabytes (MB), and gigabytes (GB), which are a thousand, a million, and a billion bytes, respectively. That’s the size of the data usually seen on personal laptops, such as Word documents, music, and videos. However, get into some very computationally intensive work, like colliding particles or sifting through millions of tweets, and we now enter into the realm of Big Data, with terabytes, petabytes, and exabytes of data. In many fields, what people call Big Data varies in size; data from the Large Hadron Collider is much larger than that from biomedical sciences. Nevertheless, it’s A LOT of data.
Data science – the science known to manage, handle, analyze, and visualize data – is difficult to define precisely, mostly because each field generates vastly different kinds of data. However, one of the most famous statisticians of the 20th century, John W. Tukey, sets up the stage. In his 1962 article, “The Future of Data Analysis,” he writes about how the statistician and his field are evolving, emphasizing analysis and management of empirical data (that is, information that is observed), and incorporating the new feasibility of such analyses provided by the computer. Tukey made a pretty accurate prediction of what data science looks like today. Generally, one uses computer software and statistical methods to generate or address questions based on some observed data.
A data scientist, or someone with data science expertise in some field, has been dubbed the sexiest job of the 21st century. Why is it so popular? One reason might be that there’s SO much data generated from every aspect of our society, and we can use that data for good – to improve diagnostics, our understanding of the universe, how to make money, and so much more. In fact, data science has become so vital to our society that President Obama invented a new job: Chief Data Scientist of the United States of America. The main motivation behind the post is to use government data to help government serve us for the better. A great example of this comes directly from the Chief Data Scientist, DJ Patel. As a doctoral student and faculty member at the University of Maryland, College Park, DJ used weather data generated from the National Oceanic and Atmospheric Administration (NOAA) to improve predictions of weather forecasting. With the advent of data.gov, there’s a great opportunity to use publicly available datasets, from multiple government agencies, and make government and private-sector decisions work for the American people.
In STEM, the applications of Big Data pervade every field, from particle physics to genomics and numerical analysis, just to name a few. There’s plenty of examples of data science applications in the social sciences and the humanities, as well as those apparent in the STEM fields. Big Data is used as an avenue for analyzing and predicting the political process, banks use it to assess financial and market trends, and, in a positive feedback loop, it’s being used to display the history of Big Data. Other examples outside of STEM include making financial predictions to earn millions of dollars, analyzing transportation hotspots to improve infrastructure in developing countries, and contributing to a political conversation through tweets, as we saw start during the Egyptian protests of 2011. There are many examples of the use of Big Data for the good of society.
However, all too often we hear about privacy invasion, financial scandals, and international threats that arise from the misuse of data. We know all too well that, once something is put up on the Internet, it’s there to stay, and this can really damage people’s reputations and livelihoods. Cases of credit card fraud, stolen identities, and consumer manipulation provide hard-felt examples of how data is a two-sided coin.
The old Spider-Man adage says it very well with regards to the generation and analysis of data: “With great power, comes great responsibility.”
Cover Photo Source: http://bit.ly/1NLZiS7.