Datasets | Notion

Home

In the Datasets tab, you can import files, apply a schema to the imported file, merge two files or split a file into test and train.

<aside> <img src="https://s3-us-west-2.amazonaws.com/secure.notion-static.com/f801b6d7-fd88-4362-abb8-fdc9a9c104dd/JAXON_Logo_Mark_on_blue.jpg" alt="https://s3-us-west-2.amazonaws.com/secure.notion-static.com/f801b6d7-fd88-4362-abb8-fdc9a9c104dd/JAXON_Logo_Mark_on_blue.jpg" width="40px" /> Users must have selected a Project for the Datasets tab to become available.

</aside>

The Datasets tab also contains our SmartSplit feature.

On this page you will find:

How to Import a Dataset

Prior to importing a dataset file into Jaxon please view the file externally and note down the following:

Does the file contain a header row?
File formatting:
1. Is the file using single or double quotes consistently?
2. Does the label column contain single or multiple labels per row? For multiple labels make sure the format of the filed is as shown below:

[”label-1”, “label-2”, “label-n”]

To import a dataset, select the Datasets tab and click + from the Dataset Menu

Fill out the intake form and select Submit

Dataset Intake Form Field Descriptions

<aside> <img src="https://s3-us-west-2.amazonaws.com/secure.notion-static.com/464d32fd-5052-4ad8-a6f3-915251eaf8f4/JAXON_Logo_Mark_on_blue.jpg" alt="https://s3-us-west-2.amazonaws.com/secure.notion-static.com/464d32fd-5052-4ad8-a6f3-915251eaf8f4/JAXON_Logo_Mark_on_blue.jpg" width="40px" /> Currently, Jaxon supports datasets in CSV, TSV, JSON, XML, and XSL/XSLX formats, either zipped or unzipped as long as all file types within a folder are homogenous.

</aside>

Specify the formatting characteristics of the dataset file in the box to the left of the dataset preview and click Import

Dataset Import Field Descriptions

<aside> <img src="https://s3-us-west-2.amazonaws.com/secure.notion-static.com/52b51b4e-6d42-40df-8e55-1ccfaa283217/JAXON_Logo_Mark_on_blue.jpg" alt="https://s3-us-west-2.amazonaws.com/secure.notion-static.com/52b51b4e-6d42-40df-8e55-1ccfaa283217/JAXON_Logo_Mark_on_blue.jpg" width="40px" /> Jaxon will attempt to automatically identify these characteristics, but for best results, we recommend verifying the information.

</aside>

Specify the columns within the dataset that will be used by Jaxon as Features (Free-Form, Numerical, Categorical) and/or Labels. Specifying at least one Features column is required to be able to use the dataset.

Feature and Label Descriptions

<aside> <img src="https://s3-us-west-2.amazonaws.com/secure.notion-static.com/c3ac0905-ae30-4517-9601-8f7205162597/JAXON_Logo_Mark_on_blue.jpg" alt="https://s3-us-west-2.amazonaws.com/secure.notion-static.com/c3ac0905-ae30-4517-9601-8f7205162597/JAXON_Logo_Mark_on_blue.jpg" width="40px" /> The dataset import may take a few minutes to complete. Once the dataset is imported, the dataset will become available in the Datasets tab.

</aside>

Note that for multi-label datasets, the Labels column must be a Python List.

From here, you can define the Specification and then use the dataset in the rest of the Jaxon Platform.

Back to the top ↑

How to Copy a Dataset

Once a dataset has been imported, it can be copied. This function creates an exact duplicate of the original dataset.

To view all available datasets within the Datasets tab, select 🔽.

Select a dataset to work with. When a dataset is successfully selected, the list of datasets disappears and only a preview of the selected dataset is shown.

Select the Copy Dataset icon

duplicate dataset icon.png

Fill out the pop up that appears and select Submit ****

Copy Pop-Up Field Descriptions

Once the dataset has been copied, the new dataset will become available in the Datasets tab.

Back to the top ↑

How to Split a Dataset

Any available dataset in the Datasets tab can be split into two smaller sets. The split ratio for both labeled and unlabeled rows is independently controlled. A dataset can be split before creating a Specification or after. In the former case, splitting will place the labeled and unlabeled examples in both the datasets. In the latter case the user is given the ability to steer unlabeled examples based on a user provided ratio.

Most times, splitting a dataset will create two new datasets while also preserving the original. However, if the Flatten feature is being used, one new dataset is created while also preserving the original.

<aside> <img src="https://s3-us-west-2.amazonaws.com/secure.notion-static.com/61489a86-9843-4008-b925-e76f37dfbf58/JAXON_Logo_Mark_on_blue.jpg" alt="https://s3-us-west-2.amazonaws.com/secure.notion-static.com/61489a86-9843-4008-b925-e76f37dfbf58/JAXON_Logo_Mark_on_blue.jpg" width="40px" /> The Train set can contain both labeled and unlabeled examples, but the Test set should not contain any unlabeled examples.

</aside>

Once the dataset has been split and/or flattened, the new dataset(s) will become available in the Datasets tab.

SmartSplit

SmartSplit is a proprietary means of splitting a dataset into training and holdout datasets in a way that avoids covariate drift and other latent differences between those datasets. Specifically, it aims to improve upon the standard baseline approach of random sampling, using a given predetermined percentage split, such as the typical 80/20 rule of thumb.

Flatten

Back to the top ↑

How to Merge Datasets

Any two datasets can be merged to provide a combined set. The merge function will merge columns with the same header into a single, combined column. Columns with non-matching names will stay as they are.

<aside> <img src="https://s3-us-west-2.amazonaws.com/secure.notion-static.com/1f078cad-3845-4c56-9805-912d2a4c01e2/JAXON_Logo_Mark_on_blue.jpg" alt="https://s3-us-west-2.amazonaws.com/secure.notion-static.com/1f078cad-3845-4c56-9805-912d2a4c01e2/JAXON_Logo_Mark_on_blue.jpg" width="40px" /> If your data has any labels, make sure the Specification has been assigned and locked before merging datasets for best results.

</aside>