The latest version of Numbers allows you to use regular expressions, also called regex, in formulas. Regex is very powerful if you can master it. You can use them to match patterns in text to perform otherwise difficult or impossible things in your spreadsheets.
You can also watch this video at YouTube.
Watch more videos about related subjects: Numbers (200 videos).
You can also watch this video at YouTube.
Watch more videos about related subjects: Numbers (200 videos).
Video Transcript
Hi this is Gary with MacMost.com. Today let's look at a powerful new feature in Numbers. The ability to use regular expressions.
MacMost is brought to you thanks to a great group of more than 700 supporters. Go to MacMost.com/patreon. There you could read more about the Patreon Campaign. Join us and get exclusive content and course discounts.
So Regular Expressions is something you could find in all sorts of programming languages and other places. It's the ability to look for patterns in text, count them and even replace text using those patterns. Numbers 10.1 adds the ability to use these in a variety of different places. This is something that Microsoft Excel doesn't even have. So it's really exciting to see it in Numbers. So let's look at some examples.
Here I've got a list of names here in a Table and I'm going to use Regular Expressions to count various things. For instance here I'm just going to count the number of names that have the letter A in them. Now the formula is here in Column B. So you can see here's the formula. It's using COUNTIF, it's looking here in this list of names, and then as the search criteria it's going to use the new function REGEX. REGEX is usually the abbreviation for regular expressions. So you see that everywhere.
Now a function in Numbers can't be used by itself. It has to be used inside of other functions like COUNTIF. In this case the pattern I'm looking for is simply the letter A. A second parameter is whether to have it be case sensitive or ignore case. In this case we're going to ignore Ccve the letter A somewhere in them. Next we've got the same function but we're doing it case sensitive. So the pattern here is looking for capital A. We only have five names that have a capital A in them. Note that the formulas is here but we can see it here because I'm using another new function in Numbers Formula Text. That just allows me to show you the formula here in a different cell. It's very useful for tutorials like this.
The third example gets into the patterns a little more. So here we can see the caret symbol and the dollar sign. Think of those as bookends for the pattern. When you include those it means that this pattern has to match the entire piece of text. It can't be part of it. Like here we have lower case a. If we put the caret symbol and the dollar sign around that the only matches we would get is that the person's name was just the letter a instead of an a appearing anywhere inside. Here it means this pattern has to be a complete piece of text. The pattern we're looking for is backslash S which stands for a non space character and then backslash lower case s which stands for a space character followed by backslash capital S a non space character.
The Plus symbols here after the capital S in each case means it could be one or more of those. So one or more letters followed by a space followed by one or more letters. That's the pattern we're looking for. That appears 10 times. It's not 12 because two of the names here have middle names. They don't match the pattern word space word.
In this next one here we're doing something very similar except we put the capital letter A before the first backslash s. So the pattern now is some word beginning with the letter A and then having one or more characters after it. If you want it to be a little more complete we can change that to an asterisk. A Plus symbol means one or more. An asterisk mean zero or more. So now we're looking for capital A followed by zero or more characters followed by a space followed by one or more characters after that. That pattern is matched three times.
But what about the last name. We could do the same thing here. We're just putting the capital letter B before the backslash S that represents the last name. We also may want to change the Plus symbol here to an asterisk to say that the last name can actually just be the letter B or it could be B followed by any number of characters. We could see there are four names like that.
Now we look for first name starts with A last name starts with B. So we're just doing both of those things. We could use the Plus symbol if we want to make sure there are at least two characters there or an asterisk if we want to allow just the letter A or just the letter B as the last or first name.
So we have some more examples here. Here we're using the brackets, bracket AB, to represent that we could use any of the characters in that set. So instead of just the letter A and just the letter B we'll allow either one of those. So either capital A or capital B followed by one or more or maybe zero of more letters. Then a space and then one or more letters. So that matches 4 times.
Now here's one where we're putting parentheses around the pattern. So it's looking for some number of letters, one or more, followed by a space and then it's putting parentheses around that so it can match that pattern one or more times. So in other words, first name or maybe first name and middle name. Then it's going to look for the letter J followed by one or more letters after that. We can make it zero or more with an asterisk. So that happens 1 time. It's right here.
As a final example here we've got a last name that has to start with some character but the second character needs to be either aeio or u. Then zero of more characters after that. In other words any name where the last name has a vowel as the second letter. So now let's look at the Count Matches function. This is a new function that gives Regular Expressions to find the number of times the pattern appears in a piece of text. So the piece of text we're going to put here in column B. Then the formula is going to be here in C. Then I'll show you the formula here in D. In this case we're looking for the Regular Expression th, just two letters, and we're seeing how many times it appears in the string. COUNTMATCHES for B1, the Regular Expression th, and False. We're not caring about case. We can see it appears 3 times. We have This, that, and other. The th in other is the 3rd time and the first time th matches even though it's a capital T because we said false for being case sensitive.
In this second example here we're going to look for the word and. But we don't want it to match the letter a n d inside another word. So we're going to use backslash lower case b to define a word boundary. So the pattern is word boundary and word boundary. So it would only match a n d if it's a single word by itself. It will work if it's the first word or the last word as well.
Now we can use patterns in an interesting way to look for words that are similar to each other. So in this case we're looking for words so we do the backslash b on either side of this pattern and we're looking for words that are th and then the letters i s or a t. The pipe symbol there found on the right side of the keyboard is an or symbol. So th is or th at. So we're going to count the number of times this or that appears.
Here's a much more complex one. We're going to use word boundaries again and we're going to look for backslash w which is a word character. So in other words not including punctuation and numbers and things. Then we're going to look for one or more of those characters. So a word of one or more characters but we're going to use this question mark equals dot asterisk and backslash 1 to basically do what's called a look at. So we find a word and then this pattern says okay now that you've found a word look ahead and see if you find an exact match for that word later on. If you do, count it. So in this case it finds that twice. It finds the word one and looks ahead and finds the word one again and then looks for two, doesn't find that, looks for three and finds that again, that's two now. There are two duplicate words in this piece of text.
Now you can also use this on numbers. Regular Expressions work on strings or pieces of text. But if you use a number it will just translate it into a string. So the number 99.2 is translated into 9 9 period 2. We could use Regular Expressions to search that piece of text. In this case we're going to look for a whole number. So the pattern here is using the caret and dollar sign to say it's got to be the entire thing. We do backslash lower case d which looks for digits 0-9 and one or more of those. This matches that pattern. It will give us a 1 for one match. This doesn't match the pattern because there's a period in here which is not a digit. However here we are looking for a decimal number. We're looking for zero or more digits followed by a period. We have to use backslash period since period is a special character in Regular Expressions. So backslash period says we really do want to look for a period. Then backslash d one or more times. So in this case 99.2 falls into that pattern. We find one match. So this is an easy way to check for decimal numbers.
Now it's more common to use Regular Expressions to look for things like valid email addresses. So in this case we've got a pretty complex one here. It's looking for letters A to Z or 0 to 9 or underscore or a period or a plus or a minus. It has got to be those characters and then one or more of those followed by an @ symbol and then a similar set of characters one or more of those followed by a period and then just some letters or periods. So this will more or less match an email address. It's not perfect. I've seen longer ones that will take into account some more variations. But in general this will give us a 1 if this is an email address and a 0 if it doesn't match.
Here's something simpler to see. In this pattern we're looking for a word boundary and then one of these things which is the first three letters of any month. Also the common abbreviations for a month. If it can find a word in here that begins with one of these three characters then it returns True. So it's a quick way to see if a date is within the first six months of the year. This is a date value here. But when we use Count Matches it's going to interpret it as text to allow us to do the comparison.
Now what happens if we have a date like this. Well we can do the same thing but instead we're going to use the Month Name function for the month from this date. So in other words it's going to translate 7/1/2020 into just the word July. Then it will search for the same thing here. So no matter what format the date is in this one will tell us whether or not it's within the first six months.
Now while you can't use REGEX outside of things like COUNTIF and COUNTMATCHES you can use the function REGEX dot EXTRACT. This will actually look for a pattern and take the result of the Regular Expression and give you that result. As an example here's a pattern that's going to look for a word break, then the letter t and then one or more word characters after that. In other words the first word it finds that starts with the letter t. It's going to return that as the result. In this text here the word two is the result. Now there's a second parameter that you can include if we look here and that's the number here. So one is the default and that will return the first result. Two would return the second result. So using the same function but with the comma two after it will give us the second result. So instead of two being the first word that starts with t, three is the second word and that's the result we get here.
Here's a practical example of that. We're going to look for a word that starts with the pound symbol, in other words a hashtag. We'll look for the first hashtag that's in this string of text. In this case we're going to look for a word that matches one of these. In this case it's one of the twelve months. When we find that it will return that. So you can have a piece of text that contains any month in it, in any format. The first month that is found in the text will be returned here. Now you can concatenate strings and build a Regular Expression based on the values in other cells. So in this case we're going to look for a word using Word Boundaries here but the middle part is going to be taken from this cell. This cell is part of the Regular Expression. It's this the quick fox with a pipe between. In other words match this or that or quick or fox. Whichever one of these words appears first in this text. That's what we'll get as a result. Here's another example of using REGEX Extract. That's to look for the first set of digits and return that. It's a way to pull the number out of some unusual text.
The last thing I want to show you is using the Substitute function. The Substitute function allows you to replace things inside of text. Before you could just use it with plain text. You could look for a certain set of characters. Replace it with another set of characters. But we can use Regular Expression to find patterns and replace those. So in this case we're looking for the pattern of a word followed by a space followed by a word, followed by a space followed by a word. Then we're going to produce a result based on what's found. Every time you put parentheses around something it's called a Capture Group. So you capture the first word, capture the second word, and you capture the third word. Now Substitute when used with REGEX can use something like this. A dollar sign and a number to denote capture groups.
So in this case Capture Group 3 was the third word. We're going to go and say Ah! we want to return the third word first, then the second word, then the first word. You can see that effectively reverses these three words as the result. Here we're doing something a little different. We're looking for word, space, word. Now what will happen here is the first Capture Group will actually match the first word and the second Capture Group will match the second word. But everything has to match. So the first Capture Group is actually going to match all of the words except the last word which will match here. So then if we say let's output the last word then a comma character followed by a space followed by the first match group you get the last word comma and then all the rest of the words.
Here's something that's a little easier to understand I think. You're looking for a pattern where there's a word boundary and then any of these words inside this Capture Group and then a word boundary. Then we're going to replace it with simply a bunch of underscores. So you can use it to say filter the words that appear in some text. Now I'm really just scratching the surface here. You can look up REGEX and REGEX EXTRACT here and get some more examples in the Help in Numbers. You can also see what other functions can use REGEX.
Now if Regular Expressions look difficult it's because they are. They are very difficult. Even experienced programmers have trouble with them. I've been coding for more than 35 years and I rarely can write a Regular Expression just off the top of my head. Usually I have to look in reference material and experiment. There are tons of websites that have beginners guides and such for Regular Expressions. You can also go to regex101.com. At this site you can actually test out Regular Expressions by putting the expression at the top and some sample text at the bottom. I used this to create a lot of these examples. Sometimes it takes a lot of trial and error and reading descriptions you get on the site on the right to figure out exactly the right pattern to use. Regular Expressions really aren't going to be useful to everybody who uses Numbers because they are so difficult to use. But if you search online for terms like REGEX and then what you're actually looking for like REGEX email addresses you'll come up with some patterns that you can then use in your Numbers spreadsheets. I'll include this Numbers document to download at this post at MacMost.com so you can checkout all of these examples.
Here is a ZIP file containing the Numbers document with all of the examples: 2224NumbersRegexExamples.zip.
Any idea as to why Apple didn't put it in Numbers Help. I thought for sure that they would include it in Help. I searched for regex. But I find the help in most Apple apps to be wanting. Thanks for your video.
Gene: It is in the help. It is formula-related, so you'll find it there in the right sidebar when you are typing formulas. Also when you look in the Functions list. Like here: https://support.apple.com/guide/functions/regex-ffac65c37db4/web
Thanks for posting the zip file with the samples. Not being a programmer, I didn't have the skills to do this kind of file manipulation. Being able to see the 'source code' helps me spot typos vs. logic errors. This is very helpful and is just what I need to get started.
I would also be interested in using this capability to search/change text files. I have many years of plain text files and I would like to be able to search and find specific words or strings of text.
Rudy: You can use regular expressions in the Terminal on text files, using a tool called grep. But if you are not a programmer, it is tough to use. At least one text editor app (BBEdit) uses regex in searches though.
A finer point with your "Count duplicate words" example: \w is equivalent to [a-zA-Z_0-9]
i.e. it DOES include numbers and the underscore character.
This is an 'artifact' of what was allowed for unix file names.
There seem to be issues with the REGEX.EXTRACT capture groups argument. For a start, this is in the doc:
If 0 is returned, an array of all capture groups will be returned.
I think they mean "if 0 is given" but I am stumped by exactly what the difference is between indexing using this argument or with INDEX. Less typing?
But more importantly, I can't get 0 to work? No matter what I do the return value from specifying 0 for capture groups is always 0 and not an array. Any ideas?
Philip: Not sure. You'll just have to keep experimenting.
I just discovered an undocumented goodie that could substantially improve functionality and performance with the new REGEX features. REGEX.EXTRACT will take REGEX as an argument type for its second argument in addition to a sting argument—i.e. REGEX_EXTRACT("abccdef",REGEX("^(.).(C+)...$)",1,2) works and returns "CC" as expected. This allows the REGEX to be stored in a cell permitting repeated extracts without the overhead of parsing and compiling the same regular expression.
Thanks for you comments on REGEX.EXTRACT groups and your discovery of the undocumented secret. I could not for the life of me extract a string between two groups. I knew the syntax was correct by using regex testers. After I looked at your last comment, I simply took my REGEX.EXTRACT string and enclosed it inside REGEX. I wish there was more room to show the details.But if REGEX.EXTRACT(Source,Regular expression) does not work, try REGEX.EXTRACT(Source,REGEX(Regular expression)).