Applies median filters to detect large jumps (i.e., outliers) in the trajectories.
Source:R/avl_cleaning.R
clean_jumps.RdNoise in GPS trajectories can manifest itself as one or more points lying far
away from points recorded at a similar time. This function identifies these
points using median filters. By default, outliers are removed. See Details
for a discussion of removal methodologies.
Usage
clean_jumps(
distance_df,
neighborhood_width = 7,
t_cutoff = 3,
min_median_deviation = -Inf,
max_median_deviation = Inf,
evaluate_tails = FALSE,
evaluate_implosions = FALSE,
replace_outliers = FALSE,
return_removals = FALSE
)Arguments
- distance_df
A dataframe of linearized AVL data. Must include
trip_id_performed,event_timestamp, anddistance.- neighborhood_width
Optional. An integer representing the total sliding window width around each observation. Default is 7 (3 on either side).
- t_cutoff
Optional. For Hampel filters, number of standardized MADs away to consider an outlier. Default is 3.
- min_median_deviation
Optional. A numeric, the minimum allowed deviation of an observation from its window median, in units of distance. Default is
-Inf.- max_median_deviation
Optional. A numeric, the maximum allowed deviation of an observation from its window median, in units of distance. Default is
-Inf.- evaluate_tails
Optional. A boolean, should the beginning and ending observations, before a complete window can be formed, be evaluated? Default is
FALSE.- evaluate_implosions
Optional. A boolean, should points in implosion sequences be evaluated? "Implosions" occur when more than half of a window is constant. Default is
FALSE.- replace_outliers
Optional. A boolean, should points identified as outliers be replaced by their window median? Default is
FALSE.- return_removals
Optional. A boolean, should the function return a dataframe of points removed and why? Default is
FALSE.
Value
The input distance_df with violating points removed. If
return_removals = TRUE, a dataframe with observations removed and why.
Details
There are many different types of median filters. In general, these filters
create a sliding window around each point (here, controlled by
neighborhood_width) and treats a point based on its deviation from the
median of that window. This function supports two main ways of classifying
outliers based on their deviation:
Raw deviation:
min_median_deviationandmax_median_deviationset bounds for an acceptable deviation between a point and the median of the window around it, in units of the inputdistancecolumn.Hampel filter: Uses the median absolute deviation (MAD), the median of deviations from the median. With a conversion factor (
s = 1.48), this is analogous to a standard error. Thet_cutoff, then, is analogous to an acceptable window oftvalues.
Both of these can be used at the same time. If multiple criteria are set, a point will be removed if it violates any criterion.
A Hampel filter is generally considered highly robust, and is the recommended approach. There are, however, two main limitations to be aware of:
It can struggle at the beginnings and ends of trips, before a complete window can be formed. Use
evaluate_tailsto skip these.When more than hald of a window has the same value, the MAD is 0 and an observation is guaranteed to be flagged as an outlier. This is known as an "implosion". As we expect noise in GPS data, even when a vehicle is standing still, this is unlikely. It can occur, however, near trip terminals, where many GPS points snap to the exact same point on the route alignment. Use
evaluate_implosionsto identify and skip points in an implosion.
Once a point has been identified as an outlier, there are two possible
treatments, controlled by replace_outliers:
Replacement with the window median. This is the most common approach to median filters, but is likely not appropriate for AVL data. Replacement may introduce non-monotonicities.
Removal of the point. This is a less common approach, but may be a more sensible for this application, given that interpolating curves will be fit later in the cleaning process.
Examples
# Set my parameters
my_cutoff = 2.5
my_neighborhood = 9
# Get input data
c53_no_overlaps <- new_transittraj_data("clean_overlapping_subtrips")
dim(c53_no_overlaps)
#> [1] 639 11
# Run function
c53_no_jumps <- clean_jumps(distance_df = c53_no_overlaps,
neighborhood_width = my_neighborhood,
t_cutoff = my_cutoff)
dim(c53_no_jumps)
#> [1] 637 11
head(c53_no_jumps)
#> # A tibble: 6 × 11
#> location_ping_id vehicle_id trip_id_performed service_date route_id
#> <chr> <chr> <chr> <date> <chr>
#> 1 12620 2836 1306100 2026-02-16 C53
#> 2 12647 2836 1306100 2026-02-16 C53
#> 3 12728 2836 1306100 2026-02-16 C53
#> 4 12809 2836 1306100 2026-02-16 C53
#> 5 12890 2836 1306100 2026-02-16 C53
#> 6 12971 2836 1306100 2026-02-16 C53
#> # ℹ 6 more variables: direction_id <dbl>, speed <dbl>,
#> # trip_stop_sequence <dbl>, event_timestamp <dttm>, stop_id <int>,
#> # distance <dbl>