Extract text processor

Anthony Pépin Updated by Anthony Pépin

This processor allows you to extract any part of a text, number, or combination of the two, and put them in a new column.

It's similar to the Replace via Regexp processor, except that instead of replacing the content in the same original column, a new column is created with the extracted text.

Setting the processor

To set the parameters of the Extract text processor, follow the indications from the table below.

Label

Description

Mandatory

Field

Field containing the values you want to extract.

Yes

Regular expression

Regular expression to determine which part of the values will be extracted. See https://en.wikipedia.org/wiki/Regular_expression for more details on how to write a regular expression.

It's possible to test regular expressions with an online debugger tool like Regex101.

Yes

Example

We'll use the same example as for the Replace via Regexp processor: From a French zip code like 44100, we want to keep only the area code—in this case the first two digits, so "44." The Extract text processor can be used to create another column with the area code selected, instead of replacing the content like with the Replace via Regexp processor.

Replace Regexp

In technical language, this processor is used to extract an arbitrary pattern expressed as a regular expression out of a string using sub-matching.

The syntax of the sub-matching expression is as follows: (?P<NAME>REGEXP). Where:

  • NAME is the name of a new field which will receive the result of the extraction. This field name can only contain letters, numbers and underscores (special characters like accentuated letters or commas are not allowed).
  • REXGEXP is the submatch expression.

For example, let's assume that you want to extract a street name out of an address. That is, for the address

600 Pennsylvania Ave NW, Washington, DC 20500, United States

you might want to extract the value Pennsylvania Ave NW in a field street_name.

You would have to write the following expression:

[0-9]+ (?P<street_name>.*), .*, .*, .*

And if you want to extract the street number in a field street_number, simply extend the previous expression:

(?P<street_number>[0-9]+) (?P<street_name>.*), .*, .*, .*

How did we do?

Extract from JSON processor

File processor

Contact

Powered by HelpDocs (opens in a new tab)