Data Science

Posted on

Data science turns data into knowledge. Many people think that data science is about making complex models, writing code, or parsing large amounts of data, but it is not. It is making predictions and very educated guesses. It is finding trends and patterns. That is when people need to make complex models and write code and parse large amounts of data. A data scientist’s goal is to extract information from data, and nobody cares what tools they use.

A data scientist would write code to find very complicated patterns that humans could never detect. Sometimes they would identify trends in historical data and try to figure out what will happen next. Then, they could use that information to predict traffic, weather, sales, and stocks. Other times, a program compares items in a list of models and finds patterns. Later, a program could use that information in junk email filters, personal recommendations, malware detection, and recognition of objects and text. Data science can help with all sorts of critical everyday tasks.

Regardless of what the goal is, the steps a data scientist follows are often very similar. First, they would need to identify a problem or question. For example, “What will the weather be tomorrow?” or “How do I hide junk mail from my inbox?”. Next, they would need to identify trustworthy sources of data, retrieve the data, determine how and where to store it, and keep it up to date. They would next need to clean the data, especially if they combined multiple sources. They might make all terms consistent, or at least make a list of synonyms for their program to analyze later (“blizzard” and “snowstorm”; “webmail” and “email”). They would remove duplicate values. They would simplify the data to only the relevant data (weather information) and the needed data (for example, they wouldn’t need tsunami data to receive Kansas’s weather). That is one of the most important steps because if the data scientist chooses the wrong data to analyze, they will produce incorrect results. Then, they use very complex computer programs and machine learning to fix the problem or answer the question. That is the only “data science” of the whole procedure. The other steps are needed, but this is the step that produces results. Even though the data scientist has retrieved their information, they still need to do more. They must make the retrieved information easier to understand and appealing to the eye with dashboards and graphs. When someone looks at the weather, they don’t want to see a lot of confusing text but a dashboard with images and straightforward information. Before sharing it, however, the data scientist needs to run tests to make sure nothing went wrong and that the information is correct. They could wait a few days to verify their forecast matches the actual weather or feed the program emails and see if it catches the junk ones. But even after release, the data scientist still must monitor the program in case anything changes. A line of code could be deprecated and removed or a data source could shut down. If it does break, they need to fix it. Then a new problem crops up that they need to solve, or someone thinks of a new question that they need to answer. A data scientist’s job is never done.

2 thoughts on “Data Science

  • Now that you are getting into data and how data can be turned into information, let me just add that there is a logical trap that many people fall into.
    First, definition: CORRELATION: this is a measure of how thing relate to each other. For example you can show that the increase in sales of Bibles and the increase in sales of whiskey are somewhat similar. There is a mathematical number (“correlation coefficient”) that can be calculated statistically to tell how things are correlated. This number varies from +1 to -1. A value of +1 means that the two trends match perfectly; 0 means they don’t match at all, and -1 means that they vary inversely in a perfect way: if trend 1 goes up, trend 2 goes down, etc.
    The other term is “causation.” Two trends can always be measured and have a high (or inverse) correlation coefficient, but does that mean the trend 1 CAUSES trend 2, or vice-versa. MAYBE the two trends are caused by ANOTHER trend that is not part of the calculations.
    The watchword in statistical analysis is “CORRELATION DOES NOT MEAN CAUSATION.” In the example above, does buying more Bibles CAUSE people to drink whiskey, or does drinking whiskey CAUSE people to buy more Bibles? (We sincerely hope not).
    MAYBE, just maybe, this high correlation is caused by the increase in population, which makes both of those trends go up in a similar way.
    Repeat after me: CORRELATION is not CAUSATION.

Leave a Reply

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.