Main Content

splitsqlquery

Split SQL query using paging

Description

querybasket = splitsqlquery(conn,sqlquery) splits an SQL query into a basket of multiple SQL queries. By default, each SQL query in the basket returns 100,000 rows in a batch. The resulting number of SQL queries in the basket depends on the size of the original SQL query results.

example

querybasket = splitsqlquery(conn,sqlquery,'SplitSize',splitsize) specifies a custom batch size for the number of rows returned by each SQL query in the basket.

example

Examples

collapse all

Determine the minimum arrival delay using a large set of flight data stored in a database. Access the database in a serial MATLAB® environment.

Using the splitsqlquery function, you can split the original SQL query into multiple SQL page queries. Then, you can access large data in chunks by using the fetch function.

To run this example, you must configure a JDBC data source. For more information, see the configureJDBCDataSource function.

Create a database connection to the JDBC data source MSSQLServerJDBCAuth. This data source configures a JDBC driver to a Microsoft® SQL Server® database with Windows® authentication. Specify a blank user name and password.

datasource = "MSSQLServerJDBCAuth";
username = "";
password = "";
conn = database(datasource,username,password);

Define an SQL query to select all columns from the airlinesmall table, which contains 123,523 rows and 29 columns.

sqlquery = 'SELECT * FROM airlinesmall';

Split the original SQL query into multiple page queries and display them.

querybasket = splitsqlquery(conn,sqlquery)
querybasket = 

  2×1 string array

    " SELECT * FROM (SELECT * FROM airlinesmall) temp ORDER BY 1 OFFSET 0 ROWS FETCH NEXT 100000 ROWS ONLY"
    " SELECT * FROM (SELECT * FROM airlinesmall) temp ORDER BY 1 OFFSET 100000 ROWS FETCH NEXT 23523 ROWS ONLY"

The query basket contains the page queries in a string array. The splitsqlquery function splits the queries using the default number of rows (100,000).

Define the airlinesdata variable.

airlinesdata = [];

Define the minimum arrival delay minArrDelay variable.

minArrDelay = [];

Execute the SQL page queries in querybasket by using a for loop, and import the data in chunks. Execute SQL page queries in the query basket, and import large data using the fetch function. Find and store the local minimum arrival delay for each chunk.

for i = 1: length(querybasket)

    local_airlinesdata = fetch(conn,querybasket(i));

    local_minArrDelay = min(local_airlinesdata.ArrDelay);

    minArrDelay = [minArrDelay; local_minArrDelay];

end

Find the minimum arrival delay from all the stored delays.

minArrDelay = min(minArrDelay)
minArrDelay =

   -64

Close the database connection.

close(conn)

Determine the minimum arrival delay using a large set of flight data stored in a database. Access the database using a parallel pool.

To initialize a parallel pool with a JDBC database connection, you must configure a JDBC data source. For more information, see the configureJDBCDataSource function.

Using the splitsqlquery function, you can split the original SQL query into multiple SQL page queries. Then, you can access large data in chunks by executing each SQL page query on a separate worker in the pool.

When you import large data, the performance depends on the SQL query, amount of data, machine specifications, and type of data analysis. To manage the performance, use the splitsize input argument of the splitsqlquery function.

If you have a MATLAB® Parallel Server™ license, then use the parpool (Parallel Computing Toolbox) function with the cluster profile of your choice instead of the gcp (Parallel Computing Toolbox) function.

Create a database connection to the JDBC data source MSSQLServerJDBCAuth. This data source configures a JDBC driver to a Microsoft® SQL Server® database with Windows® authentication. Specify a blank user name and password.

datasource = "MSSQLServerJDBCAuth";
username = "";
password = "";
conn = database(datasource,username,password);

Define an SQL query to select all columns from the airlinesmall table, which contains 123,523 rows and 29 columns.

sqlquery = 'SELECT * FROM airlinesmall';

Split the original SQL query into multiple page queries and display them. Specify a split size of 10,000 rows.

splitsize = 10000;
querybasket = splitsqlquery(conn,sqlquery,'SplitSize',splitsize)
querybasket = 

  13×1 string array

    " SELECT * FROM (SELECT * FROM airlinesmall) temp ORDER BY 1 OFFSET 0 ROWS FETCH NEXT 10000 ROWS ONLY"
    " SELECT * FROM (SELECT * FROM airlinesmall) temp ORDER BY 1 OFFSET 10000 ROWS FETCH NEXT 10000 ROWS ONLY"
    " SELECT * FROM (SELECT * FROM airlinesmall) temp ORDER BY 1 OFFSET 20000 ROWS FETCH NEXT 10000 ROWS ONLY"
    " SELECT * FROM (SELECT * FROM airlinesmall) temp ORDER BY 1 OFFSET 30000 ROWS FETCH NEXT 10000 ROWS ONLY"
    " SELECT * FROM (SELECT * FROM airlinesmall) temp ORDER BY 1 OFFSET 40000 ROWS FETCH NEXT 10000 ROWS ONLY"
    " SELECT * FROM (SELECT * FROM airlinesmall) temp ORDER BY 1 OFFSET 50000 ROWS FETCH NEXT 10000 ROWS ONLY"
    " SELECT * FROM (SELECT * FROM airlinesmall) temp ORDER BY 1 OFFSET 60000 ROWS FETCH NEXT 10000 ROWS ONLY"
    " SELECT * FROM (SELECT * FROM airlinesmall) temp ORDER BY 1 OFFSET 70000 ROWS FETCH NEXT 10000 ROWS ONLY"
    " SELECT * FROM (SELECT * FROM airlinesmall) temp ORDER BY 1 OFFSET 80000 ROWS FETCH NEXT 10000 ROWS ONLY"
    " SELECT * FROM (SELECT * FROM airlinesmall) temp ORDER BY 1 OFFSET 90000 ROWS FETCH NEXT 10000 ROWS ONLY"
    " SELECT * FROM (SELECT * FROM airlinesmall) temp ORDER BY 1 OFFSET 100000 ROWS FETCH NEXT 10000 ROWS ONLY"
    " SELECT * FROM (SELECT * FROM airlinesmall) temp ORDER BY 1 OFFSET 110000 ROWS FETCH NEXT 10000 ROWS ONLY"
    " SELECT * FROM (SELECT * FROM airlinesmall) temp ORDER BY 1 OFFSET 120000 ROWS FETCH NEXT 3523 ROWS ONLY"

The query basket contains the page queries in a string array. Each SQL query in the basket, except the last one, returns 10,000 rows.

Close the database connection.

close(conn)

Start the parallel pool.

pool = gcp;
Starting parallel pool (parpool) using the 'local' profile ...
Connected to the parallel pool (number of workers: 6).

Initialize the parallel pool using the JDBC data source.

c = createConnectionForPool(pool,datasource,username,password);

Define the airlinesdata variable.

airlinesdata = [];

Define the minimum arrival delay minArrDelay variable.

minArrDelay = [];

Use the parfor function to parallelize data access using the query basket.

For each worker:

  • Retrieve the database connection object.

  • Execute the SQL page query from the query basket and import data locally.

  • Find the local minimum arrival delay.

  • Store the local minimum arrival delay.

parfor i = 1: length(querybasket)

    conn = c.Value;

    local_airlinesdata = fetch(conn,querybasket(i));

    local_minArrDelay = min(local_airlinesdata.ArrDelay);

    minArrDelay = [minArrDelay; local_minArrDelay];

end

Find the minimum arrival delay using the stored delays from each worker.

minArrDelay = min(minArrDelay)
minArrDelay =

   -64

Close the parallel pool.

delete(pool)

Input Arguments

collapse all

Database connection, specified as a connection object created with the database function, connection object created with the mysql function, connection object created with the postgresql function, or sqlite object.

Create a parallelizable databaseDatastore object by first creating a parallel pool constant. You can use the getSecret function to retrieve your user credentials when you create this constant.

Example: conn = parallel.pool.Constant(@()postgresql(getSecret("PostgreSQL.username"),getSecret("Postgresql.password"),"Server","localhost","DatabaseName","toy_store"),@close);

SQL statement, specified as a character vector or string scalar.

For information about the SQL query language, see the SQL Tutorial.

Example: SELECT * FROM invoice selects all columns and rows from the invoice table.

Data Types: char | string

SQL query split size, specified as a numeric scalar. Specify this number to split an SQL query into a custom number of rows for each batch.

If the total number of rows returned from the original SQL query is less than 100,000 (the default), then the splitsqlquery function returns the original SQL query. Use this input argument to specify a smaller number of rows in a batch.

Data Types: double

Output Arguments

collapse all

SQL query basket, returned as a string array. Each SQL query in the basket is returned as a string scalar in the string array.

You can execute each SQL query in the basket using the fetch function. Or, you can run a parallel pool and assign each SQL query to a worker for execution.

Limitations

  • The splitsqlquery function supports these databases only:

    • Microsoft® SQL Server® 2012 and later

    • Oracle®

    • MySQL®

    • PostgreSQL

    • SQLite

    • Amazon Redshift®

    • Amazon Aurora®

    • Google® Cloud SQL that runs an instance of MySQL or PostgreSQL

    • MariaDB®

    If the connection object uses an unsupported database, the splitsqlquery function displays a warning and returns the original SQL query.

  • The splitsqlquery function does not support the MATLAB® interface to SQLite.

Version History

Introduced in R2017b

See Also

| | | | (Parallel Computing Toolbox) | (Parallel Computing Toolbox) | (Parallel Computing Toolbox) | (Parallel Computing Toolbox)

External Websites