Loading the Data in Partitioned Tables

Loading the Data in Partitioned Tables
Prev	Chapter 1. MobilityDB Tutorial	Next

PostgreSQL provides partitioning mechanisms so that large tables can be split in smaller physical tables. This may result in increased performance when querying and manipulating large tables. We will split the Trips table given in the previous section using list partitioning, where each partitition will contain all the trips that start at a particular date. For doing this, we use the procedure given next for automatically creating the partitions according to a date range.

CREATE OR REPLACE FUNCTION create_partitions_by_date(TableName TEXT, StartDate DATE,
  EndDate DATE)
RETURNS void AS $$
DECLARE
  d DATE;
  PartitionName TEXT;
BEGIN
  IF NOT EXISTS (
    SELECT 1
    FROM information_schema.tables 
    WHERE table_name = lower(TableName)) 
  THEN
    RAISE EXCEPTION 'Table % does not exist', TableName;
  END IF;
  IF StartDate >= EndDate THEN
    RAISE EXCEPTION 'The start date % must be before the end date %', StartDate, EndDate;
  END IF;
  d = StartDate;
  WHILE d <= EndDate 
  LOOP
    PartitionName = TableName || '_' || to_char(d, 'YYYY_MM_DD');
    IF NOT EXISTS (
      SELECT 1
      FROM information_schema.tables 
      WHERE  table_name = lower(PartitionName))
    THEN
      EXECUTE format('CREATE TABLE %s PARTITION OF %s FOR VALUES IN (''%s'');', 
        PartitionName, TableName, to_char(d, 'YYYY-MM-DD'));
      RAISE NOTICE 'Partition % has been created', PartitionName;
    END IF;
    d = d + '1 day'::interval;
  END LOOP;
  RETURN;
END
$$ LANGUAGE plpgsql;

In order to partition table Trips by date we need to add an addition column TripDate to table TripsInput.

ALTER TABLE TripsInput ADD COLUMN TripDate DATE;
UPDATE TripsInput t1 
SET TripDate = t2.TripDate
FROM (SELECT DISTINCT TripId, date_trunc('day', MIN(t) OVER (PARTITION BY TripId))
AS TripDate FROM TripsInput) t2
WHERE t1.TripId = t2.TripId;

Notice that the UPDATE statement above takes into account the fact that a trip may finish at a day later than the starting day.

The following statements create table Trips partitioned by date and the associated partitions.

DROP TABLE Trips CASCADE;
CREATE TABLE Trips (
  TripId integer,
  TripDate date,
  VehicleId integer NOT NULL REFERENCES Vehicles(VehicleId),
  Trip tgeompoint NOT NULL,
  Trajectory geometry,
  PRIMARY KEY (TripId, TripDate)
) PARTITION BY LIST(TripDate);

SELECT create_partitions_by_date('Trips', (SELECT MIN(TripDate) FROM TripsInput), 
(SELECT MAX(TripDate) FROM TripsInput));

To see the partitions that have been created automatically we can use the following statement.

SELECT I.inhrelid::regclass AS child
FROM pg_inherits I
WHERE i.inhparent = 'trips'::regclass;

In our case this would result in the following output.

trips_2020_06_01
trips_2020_06_02
trips_2020_06_03
trips_2020_06_04
trips_2020_06_05

We modify the query that loads table Trips from the data in table TripsInput as follows.

INSERT INTO Trips
SELECT TripId, TripDate, VehicleId, tgeompoint_seq(array_agg(tgeompoint_inst(
  ST_Transform(ST_SetSRID(ST_MakePoint(PosX,PosY), 4326), 5676), t) ORDER BY t))
FROM TripsInput
GROUP BY TripId, TripDate, VehicleId;

We can see how many trips are in each partition of the TripsInput as follows.

SELECT COUNT(*) FROM trips_2020_06_01;
-- 423
SELECT COUNT(*) FROM trips_2020_06_02;
-- 411
SELECT COUNT(*) FROM trips_2020_06_03;
-- 415
SELECT COUNT(*) FROM trips_2020_06_04;
-- 419
SELECT COUNT(*) FROM trips_2020_06_05;
-- 4

Then, we can define the indexes and the views on the table Trips as shown in the previous section.

An important advantange of the partitioning mechanism in PostgreSQL is that the constraints and the indexes defined on the Trips table are propagated to the partitions as shown next.

INSERT INTO Trips VALUES (1, '2020-06-01', 10,
  '[POINT(2389629.8979609837 5626986.483650829)@2020-06-02 08:00]');
-- ERROR:  duplicate key value violates unique constraint "trips_2020_06_01_pkey"
-- DETAIL:  Key (tripid, tripdate)=(1, 2020-06-01) already exists.

Similarly, queries on the Trips table are propagated to the partitions as shown next.

EXPLAIN SELECT COUNT(*) FROM Trips WHERE Trip && tstzspan '[2020-06-02, 2020-06-03)';

If there is no index defined on the Trip column, the execution plan of the query is as follows:

Aggregate (cost=63.64..63.65 rows=1 width=8)
-> Append (cost=0.00..63.62 rows=5 width=0)
   -> Seq Scan on trips_2020_06_01 trips_1  (cost=0.00..11.29 rows=1 width=0)
        Filter: (trip && '[2020-06-02 00:00:00+02, 2020-06-03 00:00:00+02)'::tstzspan)
   -> Seq Scan on trips_2020_06_02 trips_2  (cost=0.00..11.14 rows=1 width=0)
        Filter: (trip && '[2020-06-02 00:00:00+02, 2020-06-03 00:00:00+02)'::tstzspan)
   -> Seq Scan on trips_2020_06_03 trips_3  (cost=0.00..11.19 rows=1 width=0)
        Filter: (trip && '[2020-06-02 00:00:00+02, 2020-06-03 00:00:00+02)'::tstzspan)
   -> Seq Scan on trips_2020_06_04 trips_4  (cost=0.00..10.24 rows=1 width=0)
        Filter: (trip && '[2020-06-02 00:00:00+02, 2020-06-03 00:00:00+02)'::tstzspan)
   -> Seq Scan on trips_2020_06_05 trips_5  (cost=0.00..19.75 rows=1 width=0)
        Filter: (trip && '[2020-06-02 00:00:00+02, 2020-06-03 00:00:00+02)'::tstzspan)

After defining an index on the Trip column as follows

CREATE INDEX Trips_Trip_gist_Idx ON Trips USING gist (Trip);

the execution plan of the query is as follows

Aggregate (cost=33.73..33.74 rows=1 width=8)
-> Append (cost=0.14..33.71 rows=5 width=0)
   -> Index Scan using trips_2020_06_01_trip_idx on trips_2020_06_01 trips_1
      (cost=0.14..8.16 rows=1 width=0)
        Index Cond: (trip && '[2020-06-02 00:00:00+02, 2020-06-03 00:00:00+02)'::tstzspan)
   -> Index Scan using trips_2020_06_02_trip_idx on trips_2020_06_02 trips_2
      (cost=0.14..8.16 rows=1 width=0)
        Index Cond: (trip && '[2020-06-02 00:00:00+02, 2020-06-03 00:00:00+02)'::tstzspan)
   -> Index Scan using trips_2020_06_03_trip_idx on trips_2020_06_03 trips_3
      (cost=0.14..8.16 rows=1 width=0)
        Index Cond: (trip && '[2020-06-02 00:00:00+02, 2020-06-03 00:00:00+02)'::tstzspan)
   -> Index Scan using trips_2020_06_04_trip_idx on trips_2020_06_04 trips_4
      (cost=0.14..8.16 rows=1 width=0)
        Index Cond: (trip && '[2020-06-02 00:00:00+02, 2020-06-03 00:00:00+02)'::tstzspan)
   -> Seq Scan on trips_2020_06_05 trips_5  (cost=0.00..1.05 rows=1 width=0)
        Filter: (trip && '[2020-06-02 00:00:00+02, 2020-06-03 00:00:00+02)'::tstzspan)