Data science and sociology: how to use data to explore and model social phenomena

Carlo C.
8 min readSep 19, 2023

--

Sociology by Author with ideogram.ai

1. Sociology as a social science that studies social phenomena

Sociology is the social science that deals with studying social phenomena, i.e. the interactions, relationships, structures and processes that characterize the social life of human beings. Sociology has as its main objective to understand and explain social reality, both at the micro level (individuals and groups) and at the macro level (society and cultures).

The main theories of sociology are those that seek to provide interpretative models and analytical perspectives on social phenomena, based on concepts, hypotheses and general principles. Some of the best-known theories are functionalist, conflicting, interactionist, structuralist and postmodern.

The main methods of sociology are those that allow to collect, process and analyze empirical data on social phenomena, using qualitative or quantitative techniques. Some of the most used methods are participant observation, interview, questionnaire, content analysis and statistical analysis.

Sociology is now facing new challenges and opportunities in the digital age, due to the increasing availability and complexity of social data, the spread of digital technologies and the transformation of society into an information society. Sociology must therefore adapt to these changes and integrate its knowledge and skills with those of data science.

Sociology has important applications and benefits in different fields, both academic and professional. Sociology can help generate new knowledge and solve social problems in areas such as education, health, politics, economics, communication, culture and the environment.

2. Sociology as a data source for data science

Data science is the discipline that deals with extracting value from data, using scientific methods, algorithms, techniques and computer tools. Data science has as its main objective to discover and communicate knowledge and solutions based on data, both at a descriptive level and at a predictive or prescriptive level.

Social data is data about human beings and their interactions, relationships, behaviors, opinions, feelings, values, and cultures. Social data are a valuable and indispensable source for data science, as they allow us to analyze and understand social reality in an objective and quantitative way.

The sources of social data are multiple and varied, and can be classified into two broad categories: traditional sources and digital sources. Traditional sources are those that produce social data through classical methods of collection, such as censuses, surveys, interviews and observations. Digital sources are those that produce social data through digital technologies, such as social media, mobile devices, sensors and online platforms.

Methods of social data collection are the processes that allow social data to be obtained from available sources, in a systematic and rigorous way. Social data collection methods can be divided into two types: active methods and passive methods. Active methods are those that require the active participation of individuals or social groups, such as questionnaires, interviews and focus groups. Passive methods are those that do not require the active participation of individuals or social groups, but are based on the analysis of data generated spontaneously or involuntarily, such as data from social media, mobile devices or sensors.

Social data analysis techniques are the procedures that transform social data into useful and meaningful information, using statistical, mathematical or computational methods. Social data analysis techniques can be divided into two categories: descriptive techniques and inferential techniques. Descriptive techniques are those that allow you to summarize and visualize social data, using measures of central trend, dispersion, correlation or association. Inferential techniques are those that allow conclusions and generalizations to be drawn about social data, using hypothesis tests, confidence intervals or predictive models.

The issues related to social data are the difficulties and limitations encountered in the management and analysis of social data, due to their complex and dynamic nature. Some of the most common issues are those related to the quality, quantity, representativeness, privacy and ethics of social data.

Social data solutions are the strategies and actions that can be taken to address and solve social data issues, using data science skills and tools. Some of the most effective solutions are those related to the cleaning, standardization, integration, protection and regulation of social data

3. Machine learning as a tool to simulate social phenomena

Machine learning is the branch of artificial intelligence that deals with creating systems that can learn from data, without being explicitly programmed. Machine learning has as its main goal to create models that can mimic or exceed human abilities to solve complex problems.

Simulations are simplified and controlled representations of reality, which allow you to explore and experiment with alternative scenarios, in order to test hypotheses, predict effects or optimize solutions. Simulations are powerful and versatile tools for the study of social phenomena, as they allow to analyze the dynamics and interactions between social agents, both at the micro and at the macro level.

The sources of simulations of social phenomena are social data that are used to feed and validate simulation models. Social data sources can be both traditional and digital, as we saw in the previous subtitle. The sources of social data must be chosen on the basis of the quality, quantity, representativeness and relevance of the data for the social phenomenon to be simulated.

Methods for creating simulations of social phenomena are the processes that allow you to build and calibrate simulation models, using machine learning techniques and algorithms. Methods of creating simulations of social phenomena can be divided into two types: equation-based methods and agent-based methods. Equation-based methods are those that use mathematical formulas to describe the behavior of social agents and the relationships between social variables. Agent-based methods are those that use autonomous and interacting entities to represent social agents and their rules of behavior.

Techniques for running simulations of social phenomena are the procedures that allow simulation models to operate and monitor, using computer resources and tools. The techniques of performing simulations of social phenomena can be divided into two categories: deterministic techniques and stochastic techniques. Deterministic techniques are those that always produce the same results from the same inputs. Stochastic techniques are those that introduce random or probabilistic elements into simulation models.

The problematics related to simulations of social phenomena are the difficulties and limitations encountered in the realization and use of simulations of social phenomena, due to their complexity and uncertainty. Some of the most common issues are those related to validity, robustness, scalability, reproducibility and interpretation of simulations of phenomena.

4. Practical examples of the use of ML applied to sociology

In this subtitle, we will see some concrete examples of how machine learning can be applied to sociology, to solve real problems and create social value. For each example, we will describe the problem, method, result, and benefit of using machine learning.

4.1 Using clustering to segment and profile social groups

The problem we want to address is to identify and characterize the different social groups that make up a population, based on demographic, socioeconomic, cultural or behavioral variables. This allows us to better understand the structure and composition of society, and to adapt policies and strategies according to the needs and preferences of different groups.

The method we use is that of clustering, an unsupervised machine learning technique that allows you to group elements according to their similarity, without having a priori knowledge of the categories. Clustering is based on algorithms that calculate the distance between elements and assign each element to the nearest cluster. Some of the most used algorithms are k-means, hierarchical analysis and DBSCAN.

The result we obtain is a segmentation of social groups, that is, a subdivision of the population into homogeneous and distinct subgroups. Each cluster is represented by a centroid, which summarizes its mean characteristics, and by a standard deviation, which measures its internal variability. We can visualize clusters using dimensional reduction techniques, such as PCA or t-SNE.

The benefit we derive is a profiling of social groups, i.e. a detailed and in-depth description of the different groups in terms of relevant variables. We may use these profiles to understand differences and similarities between groups, to identify target groups or vulnerable groups, to personalize services or products, to predict behaviors or opinions.

4.2 Using supervised models with structured data to classify and predict social variables

The problem we want to address is to classify and predict the social variables that influence or depend on the behavior of individuals or social groups, based on independent or explanatory variables. This allows us to better understand causal relationships and correlations between social variables, and to anticipate the effects or consequences of certain actions or situations.

The method we use is that of supervised models with structured data, a machine learning technique that allows you to create models that can learn from a set of labeled data, or data in which the dependent or target variable is known. Supervised models with structured data are based on algorithms that calculate the function that best approximates the relationship between the independent variables and the dependent variable. Some of the most used algorithms are linear regression, logistic regression, decision tree, random forest and support vector machine.

The result we obtain is a classification or prediction of social variables, that is, an assignment or estimate of the value of the dependent or target variable for each element of the data set. We can evaluate the goodness of the models using performance metrics, such as accuracy, precision, recall, F1-score or coefficient of determination.

The resulting benefit is an analysis and anticipation of social variables, i.e. an understanding and projection of social phenomena in terms of quantifiable and measurable variables. We can use these models to test hypotheses, to estimate impacts, to make recommendations, or to intervene on social phenomena.

4.3 Using NLP models with unstructured data to analyze and interpret social texts

The problem we want to address is to analyze and interpret social texts that express the opinions, feelings, emotions, intentions, requests or information of individuals or social groups, based on the natural language used. This allows us to better understand the meaning and value of social texts, and to extract relevant or useful information for different purposes.

The method we use is that of NLP models with unstructured data, a machine learning technique that allows us to create models that can understand and manipulate natural language, using unstructured data, that is, data that does not have a predefined or standardized form. NLP models with unstructured data are based on algorithms that calculate the semantic and syntactic representation of social texts. Some of the most used algorithms are word embedding, bag of words, n-gram, TF-IDF and BERT.

The result we obtain is an analysis or interpretation of social texts, i.e. an extraction or generation of relevant or interesting information from social texts. We can assess the quality of models using evaluation metrics, such as consistency, relevance, completeness, or creativity.

The benefit that we derive is an understanding and appreciation of social texts, that is, a knowledge and exploitation of the contents and expressions of social texts. We can use these models to classify the opinion or sentiment of social texts, to extract key entities or concepts from social texts, to generate summaries or paraphrases of social texts, to answer questions or requests from social texts.

Conclusion

In this article, we have seen how data science and sociology can collaborate to use data to explore and model social phenomena. We have seen that sociology is a social science that studies social phenomena, and that provides valuable and indispensable social data for data science. We have seen that machine learning is a branch of artificial intelligence that creates models that learn from data, and that offers powerful and versatile tools to simulate social phenomena. Finally, we have seen some practical examples of how machine learning can be applied to sociology, to solve real problems and create social value.

--

--

Carlo C.

Data scientist, avidly exploring ancient philosophy as a hobby to enhance my understanding of the world and human knowledge.