This vignette describes how to use the dai package to use and control the Driverless AI platform. It covers the main predictive data-science workflow, i.e.:

  1. Data load
  2. Automated feature engineering and model tuning
  3. Model inspection
  4. Predicting on new data
  5. Managing the datasets and models

Loading the data

Before we can start working with the Driverless AI platform, we have to import the package and initialize the connection:

library(dai)
dai.connect(uri = 'http://localhost:12345', username = 'h2oai', password = 'h2oai')

After the connection has been established, you can create a new dataset:

creditcard <- dai.create_dataset('tests/smalldata/kaggle/CreditCard/creditcard_train_cat.csv', progress = FALSE)

You can switch off the progress bar whenever displayed by a function of the package by setting progress = FALSE. The progress bars can also be disabled altogether by setting the option dai.progress:

options('dai.progress' = FALSE)

The function dai.create_dataset loads the data located at the machine that hosts Driverless AI. If you wish to upload the data located at your workstation, use dai.upload_dataset instead. If you already have the data loaded into R data.frame, you can simply convert it into DAIFrame this way:

iris_dai <- as.DAIFrame(iris)
print(iris_dai)
#> DAIFrame 'd87a0c16-933e-11e9-a824-ac1f6b46eb80': 150 obs. of 5 variables
#> File path: ./tmp/d87a0c16-933e-11e9-a824-ac1f6b46eb80/iris310d26cf1632.csv.1561023374.1821127.bin

Upon creation of the dataset, you can display the basic information and summary statistics by calling generics print and summary:

print(creditcard)
#> DAIFrame 'd72912c6-933e-11e9-a824-ac1f6b46eb80': 23999 obs. of 25 variables
#> File path: tests/smalldata/kaggle/CreditCard/creditcard_train_cat.csv
summary(creditcard)
#>  ID                    LIMIT_BAL              AGE                  
#>  Min.   :            1 Min.   :        10000  Min.   :           21
#>  Mean   :        12000 Mean   :165498.7157798 Mean   :   35.3808492
#>  St.dev.: 6928.0588912 St.dev.:129130.7430653 St.dev.:    9.2710457
#>  Max.   :        23999 Max.   :      1000000  Max.   :           79
#>  Count  :        23999 Count  :        23999  Count  :        23999
#>  Unique :        23999 Unique :           79  Unique :           55
#>  PAY_1                 PAY_2                 PAY_3                
#>  Min.   :           -2 Min.   :           -2 Min.   :           -2
#>  Mean   :   -0.0031251 Mean   :   -0.1234635 Mean   :   -0.1547564
#>  St.dev.:    1.1234487 St.dev.:    1.2005912 St.dev.:     1.204058
#>  Max.   :            8 Max.   :            8 Max.   :            8
#>  Count  :        23999 Count  :        23999 Count  :        23999
#>  Unique :           11 Unique :           11 Unique :           11
#>  PAY_4                 PAY_5                 PAY_6                
#>  Min.   :           -2 Min.   :           -2 Min.   :           -2
#>  Mean   :   -0.2116755 Mean   :   -0.2528855 Mean   :   -0.2780116
#>  St.dev.:    1.1665728 St.dev.:    1.1370067 St.dev.:    1.1581916
#>  Max.   :            8 Max.   :            8 Max.   :            8
#>  Count  :        23999 Count  :        23999 Count  :        23999
#>  Unique :           11 Unique :           10 Unique :           10
#>  BILL_AMT1             BILL_AMT2             BILL_AMT3            
#>  Min.   :      -165580 Min.   :       -69777 Min.   :      -157264
#>  Mean   :50598.9286637 Mean   :48648.0474186 Mean   :46368.9035376
#>  St.dev.:72650.1978093 St.dev.:70365.3956427 St.dev.:68194.7195203
#>  Max.   :       964511 Max.   :       983931 Max.   :      1664089
#>  Count  :        23999 Count  :        23999 Count  :        23999
#>  Unique :        18717 Unique :        18367 Unique :        18131
#>  BILL_AMT4             BILL_AMT5             BILL_AMT6            
#>  Min.   :      -170000 Min.   :       -81334 Min.   :      -339603
#>  Mean   : 42369.872828 Mean   :40002.3330972 Mean   :38565.2666361
#>  St.dev.:63071.4551671 St.dev.:60345.7282797 St.dev.:59156.5011435
#>  Max.   :       891586 Max.   :       927171 Max.   :       961664
#>  Count  :        23999 Count  :        23999 Count  :        23999
#>  Unique :        17719 Unique :        17284 Unique :        16906
#>  PAY_AMT1              PAY_AMT2              PAY_AMT3             
#>  Min.   :            0 Min.   :            0 Min.   :            0
#>  Mean   : 5543.0980458 Mean   :  5815.528522 Mean   :  4969.431393
#>  St.dev.:15068.8627296 St.dev.:20797.4438849 St.dev.:16095.9292948
#>  Max.   :       505000 Max.   :      1684259 Max.   :       896040
#>  Count  :        23999 Count  :        23999 Count  :        23999
#>  Unique :         6918 Unique :         6839 Unique :         6424
#>  PAY_AMT4              PAY_AMT5              PAY_AMT6             
#>  Min.   :            0 Min.   :            0 Min.   :            0
#>  Mean   : 4743.6568607 Mean   : 4783.6436935 Mean   : 5189.5736072
#>  St.dev.: 14883.554872 St.dev.:15270.7039035 St.dev.:17630.7185745
#>  Max.   :       497000 Max.   :       417990 Max.   :       528666
#>  Count  :        23999 Count  :        23999 Count  :        23999
#>  Unique :         6028 Unique :         5984 Unique :         5988
#>  DEFAULT_PAYMENT_NEXT_MONTH SEX            EDUCATION         
#>  Min.   :        FALSE      Count  : 23999 Count  :     23999
#>  Mean   :    0.2237177      Unique :     2 Unique :         4
#>  St.dev.:    0.4167437      Top    :female Top    :university
#>  Max.   :         TRUE      Freq.  :  8921 Freq.  :     11360
#>  Count  :        23999                                       
#>  Unique :            2                                       
#>  MARRIAGE      
#>  Count  : 23999
#>  Unique :     4
#>  Top    :single
#>  Freq.  : 12876
#>                
#> 

A couple of other generics work as usual on a DAIFrame: dim, head, or format.

dim(creditcard)
#> [1] 23999    25
head(creditcard)
#>   ID LIMIT_BAL    SEX  EDUCATION MARRIAGE AGE PAY_1 PAY_2 PAY_3 PAY_4
#> 1  1     20000 female university  married  24     2     2    -1    -1
#> 2  2    120000 female university   single  26    -1     2     0     0
#> 3  3     90000 female university   single  34     0     0     0     0
#> 4  4     50000 female university  married  37     0     0     0     0
#> 5  5     50000   male university  married  57    -1     0    -1     0
#> 6  6     50000   male   graduate   single  37     0     0     0     0
#>   PAY_5 PAY_6 BILL_AMT1 BILL_AMT2 BILL_AMT3 BILL_AMT4 BILL_AMT5 BILL_AMT6
#> 1    -2    -2      3913      3102       689         0         0         0
#> 2     0     2      2682      1725      2682      3272      3455      3261
#> 3     0     0     29239     14027     13559     14331     14948     15549
#> 4     0     0     46990     48233     49291     28314     28959     29547
#> 5     0     0      8617      5670     35835     20940     19146     19131
#> 6     0     0     64400     57069     57608     19394     19619     20024
#>   PAY_AMT1 PAY_AMT2 PAY_AMT3 PAY_AMT4 PAY_AMT5 PAY_AMT6
#> 1        0      689        0        0        0        0
#> 2        0     1000     1000     1000        0     2000
#> 3     1518     1500     1000     1000     1000     5000
#> 4     2000     2019     1200     1100     1069     1000
#> 5     2000    36681    10000     9000      689      679
#> 6     2500     1815      657     1000     1000      800
#>   DEFAULT_PAYMENT_NEXT_MONTH
#> 1                       TRUE
#> 2                       TRUE
#> 3                      FALSE
#> 4                      FALSE
#> 5                      FALSE
#> 6                      FALSE

A dataset can be split into e.g. training and test sets directly in R:

splits <- dai.split_dataset(creditcard, 
                            output_name1 = 'train', 
                            output_name2 = 'test', 
                            ratio = .8,
                            seed = 25,
                            progress = FALSE)

In this case the splits is a list with two elements with names ‘train’ and ‘test’, where 80% of the data went into train and 20% into test.

splits$train
#> DAIFrame 'd87a0c18-933e-11e9-a824-ac1f6b46eb80': 19199 obs. of 25 variables
#> File path: ./tmp/d87a0c18-933e-11e9-a824-ac1f6b46eb80/train.1561023375.4921598.bin
splits$test
#> DAIFrame 'd87a0c19-933e-11e9-a824-ac1f6b46eb80': 4800 obs. of 25 variables
#> File path: ./tmp/d87a0c19-933e-11e9-a824-ac1f6b46eb80/test.1561023375.5086083.bin

By default it yields a simple random sample, but you can do stratified or time-based splits as well. See the function’s documentation for more details.

Automated feature engineering and model tuning

One of the main strengths of Driverless AI is the fully automated feature engineering along with hyperparameter tuning, model selection and ensambling. The function dai.train executes the experiment that results in a DAIModel instance representing the model.

model <- dai.train(training_frame = splits$train,
                   testing_frame = splits$test,
                   target_col = 'DEFAULT_PAYMENT_NEXT_MONTH', 
                   is_classification = T, 
                   is_timeseries = F, 
                   accuracy = 1, time = 1, interpretability = 10,
                   seed = 25)

Driverless AI can suggest values for accuracy, time, and interpretability. (See dai.suggest_model_params.) If you do not specify values for accuracy, time, or interpretability, then Driverless AI will use the recommended values.

As with DAIFrame, generic methods such as print, format, summary, or predict work with DAIModel:

print(model)
#> Status: Complete
#> Experiment: cuvofetu (da7d857e-933e-11e9-a824-ac1f6b46eb80)
#>   Version: 1.7.0+local_360ed5f-dirty, 2019-06-20 11:36
#>   Settings: 1/1/10, seed=25, GPUs disabled
#>   Train data: train (19199, 25)
#>   Validation data: N/A
#>   Test data: test (4800, 24)
#>   Target column: DEFAULT_PAYMENT_NEXT_MONTH (binary, 22.366% target class)
#> System specs: Linux, 126 GB, 40 CPU cores, 0/0 GPU
#>   Max memory usage: 0.351 GB, 0 GB GPU
#> Recipe: AutoDL (2 iterations, 2 individuals)
#>   Validation scheme: stratified, 1 internal holdout
#>   Feature engineering: 23 features scored (3 selected)
#> Timing:
#>   Data preparation: 7.38 secs
#>   Shift/Leakage detection: 2.05 secs
#>   Model and feature tuning: 3.77 secs (2 models trained)
#>   Feature evolution: 3.14 secs (1 of 3 model trained)
#>   Final pipeline training: 3.90 secs (1 model trained)
#>   Python / MOJO scorer building: 9.93 secs / 0.00 secs
#> Validation score: AUC = 0.749 +/- 0.009 (baseline)
#> Validation score: AUC = 0.749 +/- 0.009 (final pipeline)
#> Test score:       AUC = 0.72849 +/- 0.0099363 (final pipeline)
summary(model)$score
#> [1] 0.7489953
summary(model)$score_f_name
#> [1] "AUC"

Predicting on new data

New data can be scored in two different ways:

  1. Either you can call predict directly on the model in R session; or
  2. you can download a scoring pipeline and embed that into your Python or Java workflow.

Predicting in R

Generic predict either directly returns an R data.frame with the results (by default) or it returns a name of the file containing the predictions on the Driverless AI server (return_df=FALSE). The latter option may be useful when you predict on a large dataset.

predictions <- predict(model, newdata = splits$test)
head(predictions)
#>   DEFAULT_PAYMENT_NEXT_MONTH.0 DEFAULT_PAYMENT_NEXT_MONTH.1
#> 1                    0.8565767                    0.1434233
#> 2                    0.8565767                    0.1434233
#> 3                    0.8516645                    0.1483355
#> 4                    0.1783960                    0.8216040
#> 5                    0.8565767                    0.1434233
#> 6                    0.8956659                    0.1043341
preds_path <- predict(model, newdata = splits$test, return_df = FALSE)
print(preds_path)
#> [1] "h2oai_experiment_da7d857e-933e-11e9-a824-ac1f6b46eb80/da7d857e-933e-11e9-a824-ac1f6b46eb80_preds_890f8443.csv"

You can later download the file to your workstation:

dai.download_file(file_path = preds_path, dest_path = file.path(tempdir(), 'predictions.csv'), progress = FALSE)
#> [1] "/tmp/RtmpfnKana/predictions.csv"

Downloading Python or MOJO scoring pipelines

For productizing your model in Python or Java, you can download full Python or MOJO pipelines, respectively. For more information about how to use the pipelines please see the documentation.

dai.download_mojo(model, path = tempdir(), force = TRUE)
#> [1] "/tmp/RtmpfnKana/mojo.zip"
dai.download_python_pipeline(model, path = tempdir(), force = TRUE)
#> [1] "/tmp/RtmpfnKana/scorer.zip"

Managing the datasets and models

After some time, you may have multiple datasets and models on your Driverless AI server. The dai package offers a few utility functions to find, reuse, and remove the existing datasets and models.

If you already have the dataset loaded into Driverless AI, you can get the DAIFrame object by either dai.get_frame (if you know the frame’s key) or dai.find_dataset (if you know the original path or at least a part of it):

dai.get_frame(creditcard$key)
#> DAIFrame 'd72912c6-933e-11e9-a824-ac1f6b46eb80': 23999 obs. of 25 variables
#> File path: tests/smalldata/kaggle/CreditCard/creditcard_train_cat.csv
dai.find_dataset('creditcard')
#> DAIFrame 'd72912c6-933e-11e9-a824-ac1f6b46eb80': 23999 obs. of 25 variables
#> File path: tests/smalldata/kaggle/CreditCard/creditcard_train_cat.csv

The latter directly returns the frame if there’s only one match. Otherwise it lets you select which frame to return from all the matching candidates.

Furthermore, you can get a list of datasets or models:

datasets <- dai.list_datasets()
head(datasets)
#>                                    key                     name
#> 1 d87a0c19-933e-11e9-a824-ac1f6b46eb80                     test
#> 2 d87a0c18-933e-11e9-a824-ac1f6b46eb80                    train
#> 3 d87a0c16-933e-11e9-a824-ac1f6b46eb80     iris310d26cf1632.csv
#> 4 d72912c6-933e-11e9-a824-ac1f6b46eb80 creditcard_train_cat.csv
#>                                                                                file_path
#> 1                 ./tmp/d87a0c19-933e-11e9-a824-ac1f6b46eb80/test.1561023375.5086083.bin
#> 2                ./tmp/d87a0c18-933e-11e9-a824-ac1f6b46eb80/train.1561023375.4921598.bin
#> 3 ./tmp/d87a0c16-933e-11e9-a824-ac1f6b46eb80/iris310d26cf1632.csv.1561023374.1821127.bin
#> 4                             tests/smalldata/kaggle/CreditCard/creditcard_train_cat.csv
#>   file_size data_source row_count column_count import_status import_error
#> 1    567584      upload      4800           25             0             
#> 2   2265952      upload     19199           25             0             
#> 3      7064      upload       150            5             0             
#> 4   2832040        file     23999           25             0             
#>   aggregation_status aggregation_error aggregated_frame mapping_frame
#> 1                 -1                                                 
#> 2                 -1                                                 
#> 3                 -1                                                 
#> 4                 -1                                                 
#>   uploaded
#> 1     TRUE
#> 2     TRUE
#> 3     TRUE
#> 4    FALSE
models <- dai.list_models()
head(models)
#>                                    key description
#> 1 da7d857e-933e-11e9-a824-ac1f6b46eb80    cuvofetu
#>                 parameters.dataset.key parameters.dataset.display_name
#> 1 d87a0c18-933e-11e9-a824-ac1f6b46eb80                           train
#>   parameters.resumed_model.key parameters.resumed_model.display_name
#> 1                                                                   
#>        parameters.target_col parameters.weight_col parameters.fold_col
#> 1 DEFAULT_PAYMENT_NEXT_MONTH                    NA                  NA
#>   parameters.orig_time_col parameters.time_col
#> 1                       NA               [OFF]
#>   parameters.is_classification parameters.cols_to_drop
#> 1                         TRUE                      NA
#>   parameters.validset.key parameters.validset.display_name
#> 1                                                         
#>                 parameters.testset.key parameters.testset.display_name
#> 1 d87a0c19-933e-11e9-a824-ac1f6b46eb80                            test
#>   parameters.enable_gpus parameters.seed parameters.accuracy
#> 1                     NA              25                   1
#>   parameters.time parameters.interpretability parameters.score_f_name
#> 1               1                          10                     AUC
#>   parameters.time_groups_columns parameters.time_period_in_seconds
#> 1                           NULL                                NA
#>   parameters.num_prediction_periods parameters.num_gap_periods
#> 1                                NA                         NA
#>   parameters.is_timeseries parameters.config_overrides
#> 1                    FALSE                          NA
#>                                                                                                          log_file_path
#> 1 h2oai_experiment_da7d857e-933e-11e9-a824-ac1f6b46eb80/h2oai_experiment_logs_da7d857e-933e-11e9-a824-ac1f6b46eb80.zip
#>                                                                    pickle_path
#> 1 h2oai_experiment_da7d857e-933e-11e9-a824-ac1f6b46eb80/best_individual.pickle
#>                                                                                                              summary_path
#> 1 h2oai_experiment_da7d857e-933e-11e9-a824-ac1f6b46eb80/h2oai_experiment_summary_da7d857e-933e-11e9-a824-ac1f6b46eb80.zip
#>   train_predictions_path valid_predictions_path
#> 1                                              
#>                                                  test_predictions_path
#> 1 h2oai_experiment_da7d857e-933e-11e9-a824-ac1f6b46eb80/test_preds.csv
#>   progress status training_duration score_f_name     score test_score
#> 1        1      0          30.96114          AUC 0.7489953  0.7284917
#>   deprecated model_file_size diagnostic_keys
#> 1      FALSE       206380953            NULL

Similarly to dai.get_frame, you can obtain an instance of DAIModel by dai.get_model:

dai.get_model(models$key[1])
#> Status: Complete
#> Experiment: cuvofetu (da7d857e-933e-11e9-a824-ac1f6b46eb80)
#>   Version: 1.7.0+local_360ed5f-dirty, 2019-06-20 11:36
#>   Settings: 1/1/10, seed=25, GPUs disabled
#>   Train data: train (19199, 25)
#>   Validation data: N/A
#>   Test data: test (4800, 24)
#>   Target column: DEFAULT_PAYMENT_NEXT_MONTH (binary, 22.366% target class)
#> System specs: Linux, 126 GB, 40 CPU cores, 0/0 GPU
#>   Max memory usage: 0.351 GB, 0 GB GPU
#> Recipe: AutoDL (2 iterations, 2 individuals)
#>   Validation scheme: stratified, 1 internal holdout
#>   Feature engineering: 23 features scored (3 selected)
#> Timing:
#>   Data preparation: 7.38 secs
#>   Shift/Leakage detection: 2.05 secs
#>   Model and feature tuning: 3.77 secs (2 models trained)
#>   Feature evolution: 3.14 secs (1 of 3 model trained)
#>   Final pipeline training: 3.90 secs (1 model trained)
#>   Python / MOJO scorer building: 9.93 secs / 0.00 secs
#> Validation score: AUC = 0.749 +/- 0.009 (baseline)
#> Validation score: AUC = 0.749 +/- 0.009 (final pipeline)
#> Test score:       AUC = 0.72849 +/- 0.0099363 (final pipeline)

Finally, the datasets and models can be removed by dai.rm:

dai.rm(model, creditcard, splits$train, splits$test, iris_dai)
#> Model da7d857e-933e-11e9-a824-ac1f6b46eb80 removed
#> Dataset d72912c6-933e-11e9-a824-ac1f6b46eb80 removed
#> Dataset d87a0c18-933e-11e9-a824-ac1f6b46eb80 removed
#> Dataset d87a0c19-933e-11e9-a824-ac1f6b46eb80 removed
#> Dataset d87a0c16-933e-11e9-a824-ac1f6b46eb80 removed

The function dai.rm deletes the objects by default both from the server and the R session. If you wish to remove it only from the server, you can set from_session=FALSE. Please note that only objects can be removed from the session, i.e. in the example above the splits$train and splits$test objects will not be removed from R session, because they are actually function calls (recall that $ is a function).