Noisy Data Removal in Machine Learning || Hindi || Lesson 17 || Machine Learning ||
In this video we will discuss Noisy data removal in machine learning 00:40 Noisy data Example 01:44 Binning 03:25 Formatting 05:00 statistical smoothing Welcome, Wisdomers! In today's video, we’re breaking down what noisy data actually means and looking at some practical, simple techniques to clean it up. As you begin your journey into Machine Learning, mastering these basics is key. There are certainly more advanced techniques out there, but to understand those, we first need to build a solid foundation. We’ll be diving into those high-level topics in our upcoming classes, so let’s start with the basics. Let’s illustrate noisy data using a classic example: radio transmission. In this scenario, the song being broadcast represents the target data (the signal). However, environmental factors, hardware limitations, or poor tuning often introduce background static. This static is the noise. If we were to use this noisy audio as an input for a Machine Learning model—perhaps to identify musical notes—the algorithm would attempt to analyze the noise along with the music. Since the noise is random and carries no valuable information, teaching the model to 'understand' it is not productive. It results in a model that cannot distinguish between the actual message and the noise. In the same way, datasets are frequently corrupted by human entry errors or system inconsistencies. To ensure a model learns accurately, we must prioritize noise elimination during the data pre-processing stage. Shot3 Our first technique for reducing noisy data is Binning. To understand how this works, let’s look at a practical example from the banking industry. Imagine you are analyzing a dataset where 'Credit Score' is a key feature. These scores typically range from 1 to 1000. On a granular level, you might see values like 667, 668, and so on. For a Machine Learning algorithm, trying to identify unique patterns for every single individual value can be incredibly noisy. The difference between a 667 and a 668 is often statistically insignificant, yet the model might get distracted by these tiny variations. To solve this, we use Binning. Instead of looking at individual numbers, we group them into 'bins' or categories. For example: 601 to 700 becomes 'Fair' 701 to 800 becomes 'Average' 801 to 900 becomes 'Excellent' By transforming these numbers into categories, we simplify the data. This makes it much easier for the algorithm to identify meaningful patterns rather than getting lost in the 'noise' of individual values. The process is simple: we first sort the data and then distribute the elements into these predefined bins." While Binning is our go-to technique for numerical data, our next method—Formatting—is essential for cleaning categorical data. Imagine your dataset contains city names. If one entry says 'New York' with a capital 'N' and 'Y,' and another entry says 'newyork' in all lowercase, a Machine Learning algorithm will treat them as two completely different cities. This creates unnecessary noise and prevents the model from seeing the true pattern. Similarly, we often find inconsistencies in date formats, phone numbers, and currency. For instance, 'January 1st' versus '01/01' can confuse a system. To eliminate this noise, we follow a specific 'Formatting steps': Case Normalization: We convert everything to a single case—usually lowercase—so that 'Apple' and 'apple' are identical. Whitespace Removal: We strip out extra spaces that often creep in during data entry. Regular Expressions (Regex): We use Regex to strip symbols. For example, we can ensure that '$1500' and '1500' are both treated as the same numerical value. Finally, we ensure every word and format follows a universal rule. By following these steps, we ensure that the algorithm sees exactly what we see: clean, consistent, and meaningful data. Our final technique in this introductory video on noise is Statistical Smoothing. This is a powerful tool designed specifically for Time-Series data, where information is recorded in order. Let’s look at an example. Imagine you are tracking daily temperature. As you can see in the video, there is a sudden, extreme spike on one day that immediately drops back to the normal range the following day. In the real world, temperature doesn't fluctuate that drastically in such a short window; this is a classic example of noise, likely caused by a sensor glitch or a transmission error. To eliminate these spikes, we use a Moving Average. This works by calculating the average of a specific 'window' of time. For example, we can take the average of the present day, the two previous days, and the two following days. By doing this, the sudden spike is mathematically 'smoothed' into the surrounding data, allowing the true trend to emerge clearly. These three methods—Binning, Formatting, and Smoothing—represent the fundamental pillars of noise removal. Master these, and you'll be ready for the more advanced techniques we will discuss in
Download
0 formatsNo download links available.