AGU RESEARCH

Columns that reveal the world
- Getting up close and personal with the researchers -

In the world we live in,
From issues close to us to issues that affect all of humanity,
There are many different problems.
The current situation and truth that are surprisingly unknown,
Our proud faculty members offer interesting insights
We will reveal it.

  • Faculty of Economics Department of Economics
  • Statistics and today's data science are rapidly spreading in modern society.
    Statistical literacy required
  • Associate Professor Tamae Kawasaki
  • Faculty of Economics Department of Economics
  • Statistics and today's data science are rapidly spreading in modern society.
    Statistical literacy required
  • Associate Professor Tamae Kawasaki

New talent required as usage scenarios expand

My field of expertise, mathematical statistics and multivariate analysis, is also used in data science, and is currently an academic field that is of great social interest. Overseas, data scientists have been attracting so much attention for about 10 years that they have been described as "the sexiest job of the 21st century" (*), and in Japan, too, it seems that the number of people who want to use data analysis in business strategies and the like has rapidly increased, especially in recent years, as many people from companies have been attending statistical-related academic conferences. The background to this is probably the increase in business books such as "statistics is the most powerful science" and introductory books that are easy for the general public to pick up, as well as the wide range of applications for languages that can be used for data analysis, such as Python and R, which can now be easily used on ordinary PCs.

 

There is no doubt that statistics is a very powerful tool for business and the public. However, if the data is not handled and the analysis results are not interpreted with the utmost care, it can be very dangerous. If the data is analyzed correctly, perspectives that were not visible before can become visible, but on the other hand, if the data is not processed and analyzed using the appropriate method, incorrect results can be derived. To do this, it is necessary to correctly understand each method, and it is also important to be able to determine whether the analysis satisfies each assumption. If you handle statistics carelessly with an insufficient understanding, you may stray from the conclusion that should have been drawn and take the wrong path.

Currently, this is an academic field that is attracting a great deal of attention from an economic perspective, but it is also a field that involves such risks. For this reason, it is important to develop human resources who have an economic perspective, the necessary mathematical knowledge and thinking, and can perform appropriate analysis. Data scientists who are familiar with mathematical statistics and economics and can link the two at a high level are undoubtedly the kind of people that will be needed in the future society. As university professors, we have a responsibility to develop such new human resources while aiming for academic development. I am currently focusing on research into "missing values" in statistics, and through this research, which is essential for the further development of statistics and data science, I hope to establish new analysis methods and also nurture many young people.

*Harvard Business Review October 2012: “Data Scientist: The Sexiest Job of the 21st Century” Thomas H. Davenport and DJ Patil

Missing value processing technology has become increasingly important in recent years

Data science also uses a statistical method called "multivariate analysis" that analyzes and interprets multiple variables simultaneously. For example, in a survey on people's health, various items such as "age, sex, height, weight, eyesight..." are collected in addition to physical condition. Each of these items is treated as a variable, but if you analyze only "height" as data without looking at other variables, it will be difficult to obtain the intended analysis results on people's health. Multivariate analysis analyzes variables in a complex manner, and can derive analysis results such as what tendencies can be found in certain types of people.

However, while it would be ideal if all the data could be collected smoothly, some data will be lost, for example, if someone forgets to answer a question. This is called "missing values." Roderick JA Little and Donald B. Rubin have proposed the following three main mechanisms for how missing values occur:

The first, "MCAR," is a case where, for example, in a questionnaire survey, a person does not answer a question due to an oversight. In this case, the probability of missing values occurring is completely random, not dependent on the variable in question or other variables. The second, "MAR," is a case where missing data does not occur completely randomly, but is dependent on other variables and can be explained by those variables. For example, in data on income, the probability of not answering questions about income increases as people get older, so the data contains a variable related to age. In this case, missing data on income can be explained as occurring randomly with age as a condition. And the third, "MNAR," is a case where, in the previous example of income, income and age are not dependent, and people with higher incomes are more likely to not answer questions about income. In other words, the data does not contain a variable that can explain missing income.

When such missing values exist, even if they are "MNAR", data analysis cannot be performed unless the missing values are dealt with somehow, and the value of the data cannot be found. Of course, there is the option of discarding the entire part containing the missing values, but this would be a waste of time and money spent on aggregating the data, and it is not possible to re-collect the data multiple times, so how to fill in the missing values or how to handle data containing missing values is a major issue in modern data science. The graph below proposes that statistical methods can be used for data containing missing values, and the approximation accuracy can also be improved by adding a mathematical approach to achieve higher approximation accuracy.

In clinical trials of pharmaceuticals, the occurrence of missing values is inevitable, such as when a participating patient is unable to continue the trial due to the effects of their disease or for some other reason, so the handling of missing data is also described in the guidelines. In the future, anyone who handles data, not just in the pharmaceutical field, will need to have knowledge of how to handle missing values.

 

Simulation results for data with missing values

If the sample size is large enough, the curve approaches the red curve. By adding a mathematical approach modification (pink) to the proposed method (blue), the accuracy of the approximation can be improved even if the sample size is not large enough.

The Importance of "Statistical Literacy" Built on Mathematics, Economics, and Ethics

Today, as the need for data science expands, the issue of missing values is also becoming more important. I myself participated as a member of the Ministry of Land, Infrastructure, Transport and Tourism's review committee on the processing of statistical surveys, and I felt that interest in the processing of missing values has increased even more. Perhaps a major factor is that society has begun to realize the importance and usefulness of statistics. This is of course because there is value in the analytical results and interpretations derived using statistics, but it is also because people have begun to recognize the importance of facing data in order to make use of it.

Statistics is an application of mathematics, so there is no room for emotion or subjectivity. In data science, it is important to maintain this attitude from data collection to interpretation of the analysis results. You may have worked hard to collect data and performed complex analysis, but you may have come to the conclusion that "nothing can be said." However, that may be because the data you used could not say anything. If you have knowledge and knowledge of statistics and mathematics for handling data, you will be able to review your research plan, how you collect samples, and how you analyze the data. Just because you did not get the results you expected, you should not resort to subjective interpretations.

As expectations for statistics rise and opportunities to use statistics rapidly increase, I feel that such "statistical literacy" is becoming necessary. In addition to not bringing subjectivity into analysis, if statistics are not approached with awareness of how data, including personal information, is collected and managed, it can develop into legal and ethical issues. It is important not only to have a strong interest in the methods and results of data analysis, but also to have knowledge, sensitivity, and imagination about the various things that lie beyond. Without these, data analysis may become nothing more than a processing task.

Statistics is an academic field, but I believe its ultimate goal is to have it used in society. Having focused my research on statistical theory, I am now teaching at College of Economics because I have a strong interest not only in mathematical pursuits, but also in the role of statistics in society. Theoretical approaches from mathematics and economic approaches to society. And statistical literacy based on objectivity and ethics. By acquiring these foundations, you will be able to use statistical data analysis to benefit society.

Handling missing values and multivariate analysis methods are merely "tools." After mathematically analyzing data and interpreting it from a social and economic perspective, how can we use statistical literacy to apply the results of the analysis? I believe that this entire process is "data science." While understanding and pursuing detailed methods, I hope to use statistics to benefit society without losing sight of the bigger picture. (Published in November 2022)

Related articles

  • "Data Science as Liberal Arts" by Seiichi Uchida, Yoshinori Kawasaki, Daisuke Kochu, Jun Sakuma, Hiroshi Shiina, Hiroshi Nakagawa, Tomoyuki Higuchi, and Hiroshi Maruyama, edited by Genshiro Kitagawa and Akimichi Takemura (Kodansha: 2021)
  • "Introduction/Exercise Mathematical Statistics" by Kazuo Noda and Etsuo Miyaoka (Kyoritsu Publishing: 1990)
  • "Introduction to Multivariate Analysis" by Hideyuki Douke and Tsunehisa Imada (Tokai University Press: 2001)

Study this topic at Aoyama Gakuin University

College of Economics Department of Economics

  • Faculty of Economics Department of Economics
  • Associate Professor Tamae Kawasaki
Link to researcher information

Related Keywords

Related Content

  • Faculty of Business Administration
  • What will the TPP bring to our country?
  • Professor Nobuto Iwata
  • 2013年3月に参加意思表明がなされた「TPP」。新聞やニュース等でも頻繁に取り上げられ、「TPP」は私たちにとって身近な言葉となりましたが、そもそも「TPP」とは何なのか?なぜ日本はTPP交渉に参加しているのか?本コラムでは「TPP」とは何かを説くとともに、交渉参加の要因を政治・経済の両面から考察する。(2013年掲載)

  • Faculty of Economics
  • Is Abenomics the savior of the Japanese economy?
  • Professor Mazuru Nakamura
  • "Abenomics" is a word we hear almost every day these days. This is the nickname given to the economic policies of the second Abe administration. This article explains what "Abenomics" is, and sheds light on what is needed to revive the Japanese economy and what could be its savior. (Published in 2013)

  • Faculty of Business Administration, Department of Marketing
  • Posted on 2024/05/17
  • Unraveling the mechanisms of the distribution system that intertwines economy, society, culture, and history
  • Professor Nobukazu Azuma
  • "Distribution" is a huge social system that connects production and consumption. Professor Nobukazu Azuma has been shedding light on the factors that lead to the creation of distinctive distribution structures in each country and region, taking into account the background of culture, history, society, and economy. In order to unravel the mechanisms of distribution, which are woven together with diverse and complex elements, it is important to find a method that is appropriate for the research subject. For this reason, he is also working on research methods unique to social sciences. Currently, as the representative of a joint research project involving multiple universities in Japan and the UK, he is focusing on a series of research to examine leading theories and hypotheses related to the ocean of distribution as a whole and each part, and to verify them in light of actual distribution phenomena.

Related Content

  • Faculty of Business Administration
  • The economics of happiness
  • Professor Akiko Kamesaka

  • Faculty of Law
  • We'll decide the taxes!
  • Professor Yoshikazu Miki

  • Graduate School of International Management
  • Will the railroad survive?
  • Professor Yoshitaka Fukui