In the world we live in,
From issues close to us to issues that affect all of humanity,
There are many different problems.
The current situation and truth that are surprisingly unknown,
Our proud faculty members offer interesting insights
We will reveal it.
My field of expertise, mathematical statistics and multivariate analysis, is also used in data science, and is currently an academic field that is of great social interest. Overseas, data scientists have been attracting so much attention for about 10 years that they have been described as "the sexiest job of the 21st century" (*), and in Japan, too, it seems that the number of people who want to use data analysis in business strategies and the like has rapidly increased, especially in recent years, as many people from companies have been attending statistical-related academic conferences. The background to this is probably the increase in business books such as "statistics is the most powerful science" and introductory books that are easy for the general public to pick up, as well as the wide range of applications for languages that can be used for data analysis, such as Python and R, which can now be easily used on ordinary PCs.
There is no doubt that statistics is a very powerful tool for business and the public. However, if the data is not handled and the analysis results are not interpreted with the utmost care, it can be very dangerous. If the data is analyzed correctly, perspectives that were not visible before can become visible, but on the other hand, if the data is not processed and analyzed using the appropriate method, incorrect results can be derived. To do this, it is necessary to correctly understand each method, and it is also important to be able to determine whether the analysis satisfies each assumption. If you handle statistics carelessly with an insufficient understanding, you may stray from the conclusion that should have been drawn and take the wrong path.
Currently, this is an academic field that is attracting a great deal of attention from an economic perspective, but it is also a field that involves such risks. For this reason, it is important to develop human resources who have an economic perspective, the necessary mathematical knowledge and thinking, and can perform appropriate analysis. Data scientists who are familiar with mathematical statistics and economics and can link the two at a high level are undoubtedly the kind of people that will be needed in the future society. As university professors, we have a responsibility to develop such new human resources while aiming for academic development. I am currently focusing on research into "missing values" in statistics, and through this research, which is essential for the further development of statistics and data science, I hope to establish new analysis methods and also nurture many young people.
*Harvard Business Review October 2012: “Data Scientist: The Sexiest Job of the 21st Century” Thomas H. Davenport and DJ Patil
Data science also uses a statistical method called "multivariate analysis" that analyzes and interprets multiple variables simultaneously. For example, in a survey on people's health, various items such as "age, sex, height, weight, eyesight..." are collected in addition to physical condition. Each of these items is treated as a variable, but if you analyze only "height" as data without looking at other variables, it will be difficult to obtain the intended analysis results on people's health. Multivariate analysis analyzes variables in a complex manner, and can derive analysis results such as what tendencies can be found in certain types of people.
However, while it would be ideal if all the data could be collected smoothly, some data will be lost, for example, if someone forgets to answer a question. This is called "missing values." Roderick JA Little and Donald B. Rubin have proposed the following three main mechanisms for how missing values occur:
The first, "MCAR," is a case where, for example, in a questionnaire survey, a person does not answer a question due to an oversight. In this case, the probability of missing values occurring is completely random, not dependent on the variable in question or other variables. The second, "MAR," is a case where missing data does not occur completely randomly, but is dependent on other variables and can be explained by those variables. For example, in data on income, the probability of not answering questions about income increases as people get older, so the data contains a variable related to age. In this case, missing data on income can be explained as occurring randomly with age as a condition. And the third, "MNAR," is a case where, in the previous example of income, income and age are not dependent, and people with higher incomes are more likely to not answer questions about income. In other words, the data does not contain a variable that can explain missing income.
When such missing values exist, even if they are "MNAR", data analysis cannot be performed unless the missing values are dealt with somehow, and the value of the data cannot be found. Of course, there is the option of discarding the entire part containing the missing values, but this would be a waste of time and money spent on aggregating the data, and it is not possible to re-collect the data multiple times, so how to fill in the missing values or how to handle data containing missing values is a major issue in modern data science. The graph below proposes that statistical methods can be used for data containing missing values, and the approximation accuracy can also be improved by adding a mathematical approach to achieve higher approximation accuracy.
In clinical trials of pharmaceuticals, the occurrence of missing values is inevitable, such as when a participating patient is unable to continue the trial due to the effects of their disease or for some other reason, so the handling of missing data is also described in the guidelines. In the future, anyone who handles data, not just in the pharmaceutical field, will need to have knowledge of how to handle missing values.
Simulation results for data with missing values
If the sample size is large enough, the curve approaches the red curve. By adding a mathematical approach modification (pink) to the proposed method (blue), the accuracy of the approximation can be improved even if the sample size is not large enough.
Today, as the need for data science expands, the issue of missing values is also becoming more important. I myself participated as a member of the Ministry of Land, Infrastructure, Transport and Tourism's review committee on the processing of statistical surveys, and I felt that interest in the processing of missing values has increased even more. Perhaps a major factor is that society has begun to realize the importance and usefulness of statistics. This is of course because there is value in the analytical results and interpretations derived using statistics, but it is also because people have begun to recognize the importance of facing data in order to make use of it.
Statistics is an application of mathematics, so there is no room for emotion or subjectivity. In data science, it is important to maintain this attitude from data collection to interpretation of the analysis results. You may have worked hard to collect data and performed complex analysis, but you may have come to the conclusion that "nothing can be said." However, that may be because the data you used could not say anything. If you have knowledge and knowledge of statistics and mathematics for handling data, you will be able to review your research plan, how you collect samples, and how you analyze the data. Just because you did not get the results you expected, you should not resort to subjective interpretations.
As expectations for statistics rise and opportunities to use statistics rapidly increase, I feel that such "statistical literacy" is becoming necessary. In addition to not bringing subjectivity into analysis, if statistics are not approached with awareness of how data, including personal information, is collected and managed, it can develop into legal and ethical issues. It is important not only to have a strong interest in the methods and results of data analysis, but also to have knowledge, sensitivity, and imagination about the various things that lie beyond. Without these, data analysis may become nothing more than a processing task.
Statistics is an academic field, but I believe its ultimate goal is to have it used in society. Having focused my research on statistical theory, I am now teaching at College of Economics because I have a strong interest not only in mathematical pursuits, but also in the role of statistics in society. Theoretical approaches from mathematics and economic approaches to society. And statistical literacy based on objectivity and ethics. By acquiring these foundations, you will be able to use statistical data analysis to benefit society.
Handling missing values and multivariate analysis methods are merely "tools." After mathematically analyzing data and interpreting it from a social and economic perspective, how can we use statistical literacy to apply the results of the analysis? I believe that this entire process is "data science." While understanding and pursuing detailed methods, I hope to use statistics to benefit society without losing sight of the bigger picture. (Published in November 2022)