Beautiful Data and Human Behavior

New tool, new vision

Without a telescope, there was no chance for Tycho Brahe to cumulative a huge data set of stellar and planetary positions, which laid the foundation for modern astronomy.

Similarly, the widespread use of the Internet provides scholars an opportunity to study human behavior at unprecedented scale and resolution, and to see beyond the traditional theories of social sciences.

How much data is "big data" ?

Social scientists are more familiar with data sets collected in surveys and experiments, which are usually data sets at MB level. After decades of practices with these small data sets, social scientists tend to call all non-small data sets "big data".

In contrast, physicists and computer scientists have been dealing with large datasets for decades. For example, the Hubble Space Telescope collects 17 GB data per day, the Large Hadron Collider generates 42 TB data per day, and there are PB-level user behavior data sets added to Google Data Center every day.

Considering the fact that parallel computing systems are increasingly used in data storage and processing, we may be able to define "big data" as data sets that would not fit on a single machine (either memory or disk). Here I give a table to summary software/hardware requirements for data analysis across scales. GOS stands for graphical user interface operating systems. TOS stands for terminal operating systems.

	Size	Software and Hardware
Small	MB	Excel/SPSS + GOS + single machine
Medium	GB	R/Matlab/Mathematica + GOS + single machine
Large	TB	Python/PHP/... + TOS + single machine
Big	PB	Java/C++/... + TOS + multiple machines

Accuracy vs. explainability

It is increasingly popular that social scientists work with computer scientists, physicists, and scholars from other areas, to see new visions beyond traditional theories. But every perspective comes with its own limitations, and it is important to be aware of these limitations in "data science" practices.

It is interesting to note that accuracy and explainability seems to be a trade-off, i.e., the predictive models that are most powerful are usually the least interpretable. Many people think accuracy trumps explainability in the age of "big data", I would say understanding is always the first priority of scientists.

Trends to watch

The follows are several trends worth keeping an eye on (from my very personal perspective). The tools and thoughts coming out from them has dramatically shaped the development of data science and would continously show their impact.

Complex Network

It is discovered that a varst majority of social, biological, and technological networks demonstrate non-trivial and universal features, including power-law like degree distribution, high clustering coefficient, assortativity/disassortativity among nodes, hierarchical/nested structure. Network analysis closes the gap between domains and provides a unified framework to study the emergence of complex collective dynamics from the simple behavior of individual agents.

Figure credit

Deep Learning

It is everywhere now. Deep learning model are multiple layer (that is why it is called "deep") artificial neural networks. They learn pattern from data (e.g., images) and adjust the weights on edges. By storing knowledge in the linking structure, artificial neural networks are capable of making intelligent-like decisions (e.g., identifying all images containing cat faces).

Figure credit

Sentimental Analysis

Sentiment analysis aims to identify the attitude of a document automatically. It uses text mining techniques and machine learning models to extract subjective information from source materials.

Figure credit

Human Computation

Human computation is a novel computer science technique that allows machines to outsource certain computational tasks to humans in order to perform its functions. The outsourced tasks are usually those easy to human, but difficult to the available machine learning models.

Figure credit

Intelligence Amplification

In contrast with AI (Artificial Intelligence) that aims to build human-like intelligence using machines, IA (Intelligence Amplification) seeks to enhance the capability of humans to process information and solve problems. A critical step to achieve IA is data visualization. Two othe relevant yet slightly different trends are augmented reality (AR) and virtual reality (VR).

Figure credit

Beautiful Data and Human Behavior

Beautiful Data and Human Behavior

New tool, new vision

How much data is "big data" ?

Accuracy vs. explainability

Trends to watch

results matching ""

No results matching ""