Hello, and welcome. In this video, we’re going to show you how to work with regular expressions in the R programming language. In order to demonstrate the regular expression operations, we’re going to use the simple data frame here. As you can see, it contains a few names, and email addresses from different regions. Suppose our goal is to perform data analysis on each of the domains in the email addresses. The problem is, some of the email addresses have regional differences, so the url’s may differ, like the ones you see here. So what we need to do is isolate all the characters between the “at sign” and period symbol. This seems tricky at first since the url’s can have variable lengths or strange characters, so regular expressions are perfect for a task like this. Regular expressions are used to match patterns in strings and text. Suppose you needed to express the structure of an email address, like the one here, but in the general case. Basically, an email can be expressed as a set of characters, followed by an “at sign”, followed by another set of characters. What we just described could be written as this regular expression string. We have the “at sign” in the middle, with text before it, and text after it. So let’s take a closer look at what all these symbols do. The first symbol in the regular expression string is a period, but this has a special meaning. A period is like a wild card, that will match any character. The plus sign is also a special character. It is used to match the preceding pattern element one or more times. So a period followed by a plus sign will match any sequence of one or more characters. The “at sign” simply matches an “at sign” in the string. Remember that our goal is to isolate text between the “at sign” and the period symbol, but since the period symbol has a special meaning, we’ll need to alter our expression a bit. Notice in this expression, we’ve added two backslashes before the period character. This ensures that the period character itself will be matched in a string, rather than the period acting as a wild card. So the expression you see here will match an “at sign”, followed by one or more characters, followed by a period. This is exactly what we need for our problem. R provides several functions that make use of regular expressions. The first one we’ll look at is “grep”. “grep” takes in at least two inputs: The regular expression, and a list of strings you’d like to check for a match. Notice that this regular expression contains an asterisk, rather than a plus sign. An asterisk will match zero or more of the previous element, rather than one or more of the previous element. Other than that subtle difference, their behavior is the same. The output shows the list positions of the strings that match this regular expression. You can use the “value” parameter to instruct the function to output the matching strings themselves. You can also substitute strings found by the regular expression by using the “gsub” function. The second argument serves as the replacement string. In our case, all characters after an “at sign” will be replaced by “newdomain.com”. Notice how the second string was unaffected since there was no regular expression match. In order to extract the matched strings, you can use the “regexpr” function, which is like a more detailed “grep”. This function will find the matching substrings. We then pass the list of strings and the list of matches to the “regmatches” function, which gives us the desired output. We’re now ready to address our problem. Let’s apply a regular expression to the email column, making sure to isolate everything from the “at sign” to the period. We’ll then extract the matching substrings, and add them to a new column called “domain”. And here is our data frame with the new column. This structure is now well-suited for data analysis. We mainly focused on using regular expressions for data extraction, but they’re used in a wide variety of areas, like data cleaning and data mining. They’re also used for text parsing, which helps with code compilation, so it’s good to know how to work with regular expressions. By now, you should understand how to use regular expressions to replace and extract patterns in your strings. Thank you for watching this video.