Data science turns data into knowledge. Many people think that data science is about making complex models, writing code, or parsing large amounts of data, but it is not. It is making predictions and very educated guesses. It is finding trends and patterns. That is when people need to make complex models and write code and parse large amounts of data. A data scientist’s goal is to extract information from data, and nobody cares what tools they use.
A data scientist would write code to find very complicated patterns that humans could never detect. Sometimes they would identify trends in historical data and try to figure out what will happen next. Then, they could use that information to predict traffic, weather, sales, and stocks. Other times, a program compares items in a list of models and finds patterns. Later, a program could use that information in junk email filters, personal recommendations, malware detection, and recognition of objects and text. Data science can help with all sorts of critical everyday tasks.
Regardless of what the goal is, the steps a data scientist follows are often very similar. First, they would need to identify a problem or question. For example, “What will the weather be tomorrow?” or “How do I hide junk mail from my inbox?”. Next, they would need to identify trustworthy sources of data, retrieve the data, determine how and where to store it, and keep it up to date. They would next need to clean the data, especially if they combined multiple sources. They might make all terms consistent, or at least make a list of synonyms for their program to analyze later (“blizzard” and “snowstorm”; “webmail” and “email”). They would remove duplicate values. They would simplify the data to only the relevant data (weather information) and the needed data (for example, they wouldn’t need tsunami data to receive Kansas’s weather). That is one of the most important steps because if the data scientist chooses the wrong data to analyze, they will produce incorrect results. Then, they use very complex computer programs and machine learning to fix the problem or answer the question. That is the only “data science” of the whole procedure. The other steps are needed, but this is the step that produces results. Even though the data scientist has retrieved their information, they still need to do more. They must make the retrieved information easier to understand and appealing to the eye with dashboards and graphs. When someone looks at the weather, they don’t want to see a lot of confusing text but a dashboard with images and straightforward information. Before sharing it, however, the data scientist needs to run tests to make sure nothing went wrong and that the information is correct. They could wait a few days to verify their forecast matches the actual weather or feed the program emails and see if it catches the junk ones. But even after release, the data scientist still must monitor the program in case anything changes. A line of code could be deprecated and removed or a data source could shut down. If it does break, they need to fix it. Then a new problem crops up that they need to solve, or someone thinks of a new question that they need to answer. A data scientist’s job is never done.