Follow

Bulk Split Data with the API

When you upload a data file that has more complex data than a CSV can naturally represent, the split operation can be useful.

Note: In this article, rows are referred to as "units".

Any of the columns of your uploaded CSV may be internally delimited by a special character (by default, a blank space). Calling split on this column will let CrowdFlower know that the contents of this column should be treated by CrowdFlower internally as a collection of discrete items rather than a block.

Note: You are strongly advised to post complex data formats to CrowdFlower using JSON rather than CSV. JSON is better suited for complex data — the split operation is provided as a convenience operation for existing datasets, and it is not recommended for new users.​

 

Split

Method

Endpoint

Parameters



GET



/jobs/{job_id}/units/split

  • on A comma-delimited list of columns to be split.
  • with The internal delimiter for the column. Default is the space character (" ").
  • "Unit" refers to Row. The former term is eliminated from the UI.

Suppose your existing dataset is an arbitrary collection of major authors.

author,major_works,countries_active
	Homer,The Iliad|The Odyssey,Greece
	Dickens,David Copperfield|Bleak House,England
	Nabokov,Camera Obscura|Lolita,Russia|United States
	Rabelais,Gargantua and Pantagruel,France
	Cervantes,Don Quixote,Spain

When this data is posted as a CSV to CrowdFlower, one row is created for each of the five rows of data. The rows each have data associated with the three CSV columns provided. When initially posted, CrowdFlower treats all of the values transferred as free text values with no depth or structure. After the initial data post, Dickens' major works field is set to David Copperfield|Bleak House

To let CrowdFlower know that the major_works and countries_active columns are each actually collections of delimited values, you can use the split operation.

curl -X PUT --data-urlencode "key={api_key}" https://api.crowdflower.com/v1/jobs/{job_id}/units/split?on=major_works,countries_active&with=|

Note: Be careful to URL-encode the parameters.

After the PUT, CrowdFlower will consider Dickens' major_works field to be set to the collection [ "David Copperfield", "Bleak House" ]. Similarly, Nabokov's countries_active field will be set to [ "Russia", "United States" ]. The brackets indicate a data structure that is analogous to a List or Vector in Java, a list in Python, an Array in Ruby, etc. If you were to request Homer's major_works from CrowdFlower, it would be returned as a JSON array:

{major_works: [ "The Iliad","The Odyssey" ]}

Because the author field was not split, it will not be treated as a collection:

{author: "Homer"}

Was this article helpful?
0 out of 0 found this helpful


Have more questions? Submit a request
Powered by Zendesk