Beautiful Data and Human Behavior
New tool, new vision
Without a telescope, there was no chance for Tycho Brahe to cumulative a huge data set of stellar and planetary positions, which laid the foundation for modern astronomy.
Similarly, the widespread use of the Internet provides scholars an opportunity to study human behavior at unprecedented scale and resolution, and to see beyond the traditional theories of social sciences.
How much data is "big data" ?
Social scientists are more familiar with data sets collected in surveys and experiments, which are usually data sets at MB level. After decades of practices with these small data sets, social scientists tend to call all non-small data sets "big data".
In contrast, physicists and computer scientists have been dealing with large datasets for decades. For example, the Hubble Space Telescope collects 17 GB data per day, the Large Hadron Collider generates 42 TB data per day, and there are PB-level user behavior data sets added to Google Data Center every day.
Considering the fact that parallel computing systems are increasingly used in data storage and processing, we may be able to define "big data" as data sets that would not fit on a single machine (either memory or disk). Here I give a table to summary software/hardware requirements for data analysis across scales. GOS stands for graphical user interface operating systems. TOS stands for terminal operating systems.
Size | Software and Hardware | |
---|---|---|
Small | MB | Excel/SPSS + GOS + single machine |
Medium | GB | R/Matlab/Mathematica + GOS + single machine |
Large | TB | Python/PHP/... + TOS + single machine |
Big | PB | Java/C++/... + TOS + multiple machines |
Accuracy vs. explainability
It is increasingly popular that social scientists work with computer scientists, physicists, and scholars from other areas, to see new visions beyond traditional theories. But every perspective comes with its own limitations, and it is important to be aware of these limitations in "data science" practices.
It is interesting to note that accuracy and explainability seems to be a trade-off, i.e., the predictive models that are most powerful are usually the least interpretable. Many people think accuracy trumps explainability in the age of "big data", I would say understanding is always the first priority of scientists.
Trends to watch
The follows are several trends worth keeping an eye on (from my very personal perspective). The tools and thoughts coming out from them has dramatically shaped the development of data science and would continously show their impact.
Complex Network
It is discovered that a varst majority of social, biological, and technological networks demonstrate non-trivial and universal features, including power-law like degree distribution, high clustering coefficient, assortativity/disassortativity among nodes, hierarchical/nested structure. Network analysis closes the gap between domains and provides a unified framework to study the emergence of complex collective dynamics from the simple behavior of individual agents.
Deep Learning
It is everywhere now. Deep learning model are multiple layer (that is why it is called "deep") artificial neural networks. They learn pattern from data (e.g., images) and adjust the weights on edges. By storing knowledge in the linking structure, artificial neural networks are capable of making intelligent-like decisions (e.g., identifying all images containing cat faces).
Sentimental Analysis
Sentiment analysis aims to identify the attitude of a document automatically. It uses text mining techniques and machine learning models to extract subjective information from source materials.
Human Computation
Human computation is a novel computer science technique that allows machines to outsource certain computational tasks to humans in order to perform its functions. The outsourced tasks are usually those easy to human, but difficult to the available machine learning models.
Intelligence Amplification
In contrast with AI (Artificial Intelligence) that aims to build human-like intelligence using machines, IA (Intelligence Amplification) seeks to enhance the capability of humans to process information and solve problems. A critical step to achieve IA is data visualization. Two othe relevant yet slightly different trends are augmented reality (AR) and virtual reality (VR).