Skip to main content

Data Collection

The Data Collection node extracts data in bulk using a define rows first, then define columns structured approach. It is JTC RPA's core data extraction node.


Prerequisites

Before using this node, it's recommended to understand the following concepts:

  • Element Selectors — Both row selectors and column extraction rely on selectors to locate elements, particularly custom pseudo-classes like :self (reference the current row element itself)
  • CSS Selector Tutorial — Beginner-friendly guide, including DevTools operation instructions
  • Variables & Expressions — Column data can reference variables, and output results are passed via variables
  • Data Transform Pipeline — After extraction, you can apply cleaning operations like trim, toNumber, formatDate to each column's data

Overview

The core logic of Data Collection is a two-tier structure:

Rows → Define "where to look" — a set of container elements or an array
Cols → Define "what to extract" — specific fields within each row

Rows: Can be a set of container elements matched by a CSS selector (e.g., product card <div> elements, table rows <tr>), or an existing array variable. Each row corresponds to one final data record.

Columns: Within each row, define the fields to extract. Each column has an independent name and value source.

Extraction results can be output as variables for downstream use, or stored in a database. Multiple nodes can merge into the same table.

Data Collection Configuration Panel

Usage

Step 1: Select Target Object

Decide where the data comes from — "CSS Selector" extracts from the page DOM, "Array List" extracts from an existing variable.

Step 2: Define Rows

  • CSS Selector mode: Provide a container element selector. For example, .product-item for a product list page, or table.order-table tbody tr for a table. The node queries all matching elements at execution time — each match is one row.
  • Array List mode: Use {{variableName}} to reference an existing array. Each element in the array becomes one row.

Step 3: Add Columns

Click "Add Column" to define the fields to extract for each row. Each column includes:

  • Column Name: The field name — the key for this field in the output result
  • Value Source: Determines how the column's value is obtained. Four sources are supported.

Four Column Value Sources

CSS Selector (only available in CSS Selector mode)

Queries child elements within the row element. The column's extraction scope is automatically limited to the current row element — writing .title finds .title inside the row element, not the entire page. This is the most efficient source; bulk DOM queries are completed in a single underlying pass.

Manual Input

A fixed value — the same for every row. Useful for adding constant marker fields, e.g., "source": "web".

Variable Reference

Takes values from other variables. Supports {{$item}} to reference the current row data, and the Data Transform Pipeline for chained cleaning.

{{$item}} → The text content of the current row element itself
{{$item.name}} → The name property of the row object (array mode)
{{userConfig.status}} → Reference other upstream variables

Code Block

Write JavaScript code; the return value becomes the column's value. Access the current row data via $item in the code:

// $item is the current row object
return $item.price * $item.quantity;

Step 4: Configure Output

  • Storage disabled: Results are only output as variables, referenced downstream via {{variableName}}
  • Storage enabled: Results are simultaneously written to the database. Set "Output Variable" (which also serves as the table name). By default, all field combinations are deduplicated. Use "Unique Index" to specify deduplication by only certain fields. Deduplication is based on SHA256 hashing.

Specifying a Unique Index

With storage enabled, "Unique Index" is a multi-select dropdown listing all defined column names. Select one or more fields as the deduplication basis:

  • No fields selected: All column value combinations are SHA256 hashed by default. Completely identical rows are treated as duplicates — only the first is kept.
  • Some fields selected: Only the selected field combinations are hashed for deduplication. For example, selecting order_id means rows with the same order ID are deduplicated, even if other fields (price, status) differ.
Example: 5 columns collected — id, name, price, status, time

No unique index selected → Rows with identical 5-column values are treated as duplicates
id selected → Rows with the same id are treated as duplicates (later ones skipped)
id, status selected → Rows with the same id AND status are treated as duplicates

If you don't want any deduplication at all, add a column that is unique per row (e.g., current timestamp or auto-increment index) and set it as the unique index. This way every row's hash is different, and no data is skipped.

$item vs. :self

Both are used to reference the "current row" in column rules; the difference is where they're used:

UsageApplicable ScenarioMeaning
{{$item}}Variable reference, code block columnsThe current row data object
{{$item.field}}Variable reference column in array modeA specific property of the row object
:selfCSS selector columnThe current row DOM element itself
:self(.class)CSS selector columnA descendant within the row element with the specified class

$item is a runtime variable automatically injected during column computation; no manual definition is required.


Parameter Reference

ParameterTypeDefaultDescription
Target ObjectDropdownselectorCSS Selector — extract from page DOM; Array List — extract from existing array variable
Row DataTextRequired. Selector mode: CSS selector; Array mode: {{variableName}}
Column NameTextField name — the key in the output data
Column Value SourceDropdownCSS Selector / Manual Input / Variable Reference / Code Block
Output VariableTextVariable name to store the result. When storage is enabled, also serves as the database table name
Unique IndexMulti-selectAll fieldsSpecifies which fields to use for deduplication. If not set, all column value combinations are used. Based on SHA256 hashing

FAQ

Row selector matches elements but column extraction is empty

Symptom: The row count is correct, but every column's value is empty.

Cause: The column extraction rule didn't hit anything within the row elements — the column selector may target the entire page rather than within the row.

Solution: Column extraction rules are automatically scoped to within the row element. In DevTools, first select a row element, then verify the selector within it.

Some data is skipped and not stored in the database

Symptom: Fewer data records were collected than expected; some rows weren't written to the database.

Cause: By default, all field combinations are deduplicated. If two rows have identical values for all fields, the later one is skipped. If a unique index is specified, only the specified fields are used for deduplication.

Solution: To deduplicate by specific fields (e.g., only by order_id), select those fields in "Unique Index." To prevent deduplication from affecting collection results, add a unique identifier column (e.g., auto-increment index, current timestamp, etc.).

Collected text has messy formatting

Symptom: Extracted text has leading/trailing spaces or line breaks, or numbers come out as strings.

Cause: Raw text in HTML retains formatting characters.

Solution: Use the Data Transform Pipeline to clean column data — add trim to remove whitespace, toNumber to convert to numbers, stripHTML to remove tags, etc.