Extract text processor
This processor allows you to extract any part of a text, number, or combination of the two, and put them in a new column.
It's similar to the Replace via Regexp processor, except that instead of replacing the content in the same original column, a new column is created with the extracted text.
Setting the processor
To set the parameters of the Extract text processor, follow the indications from the table below.
Label | Description | Mandatory |
Field | Field containing the values you want to extract. | Yes |
Regular expression | Regular expression to determine which part of the values will be extracted. See https://en.wikipedia.org/wiki/Regular_expression for more details on how to write a regular expression.
| Yes |
Example
We'll use the same example as for the Replace via Regexp processor: From a French zip code like 44100, we want to keep only the area code—in this case the first two digits, so "44." The Extract text processor can be used to create another column with the area code selected, instead of replacing the content like with the Replace via Regexp processor.
In technical language, this processor is used to extract an arbitrary pattern expressed as a regular expression out of a string using sub-matching.
The syntax of the sub-matching expression is as follows: (?P<NAME>REGEXP)
. Where:
NAME
is the name of a new field which will receive the result of the extraction. This field name can only contain letters, numbers and underscores (special characters like accentuated letters or commas are not allowed).REXGEXP
is the submatch expression.
For example, let's assume that you want to extract a street name out of an address. That is, for the address
600 Pennsylvania Ave NW, Washington, DC 20500, United States
you might want to extract the value Pennsylvania Ave NW
in a field street_name
.
You would have to write the following expression:
[0-9]+ (?P<street_name>.*), .*, .*, .*
And if you want to extract the street number in a field street_number
, simply extend the previous expression:
(?P<street_number>[0-9]+) (?P<street_name>.*), .*, .*, .*