**Logical operators: ** Filtering for certain observations (e.g. flights from a
particular airport) is often of interest in data frames where we might want to
examine observations with certain characteristics separately from the rest of
the data. To do so we use the `filter` function and a series of
**logical operators**. The most commonly used logical operators for data
analysis are as follows:
- `==` means "equal to"
- `!=` means "not equal to"
- `>` or `<` means "greater than" or "less than"
- `>=` or `<=` means "greater than or equal to" or "less than or equal to"

We can also obtain numerical summaries for these flights:
```{r lax-flights-summ}
lax_flights %>%
summarise(mean_dd = mean(dep_delay), median_dd = median(dep_delay), n = n())
```
Note that in the `summarise` function we created a list of three different
numerical summaries that we were interested in. The names of these elements are
user defined, like `mean_dd`, `median_dd`, `n`, and you could customize these names
as you like (just don't use spaces in your names). Calculating these summary
statistics also require that you know the function calls. Note that `n()` reports
the sample size.
**Summary statistics: ** Some useful function calls for summary statistics for a
single numerical variable are as follows:
- `mean`
- `median`
- `sd`
- `var`
- `IQR`
- `min`
- `max`
Note that each of these functions take a single vector as an argument, and
returns a single value.

We can also filter based on multiple criteria. Suppose we are interested in
flights headed to San Francisco (SFO) in February:
```{r}
sfo_feb_flights <- nycflights %>%
filter(dest == "SFO", month == 2)
```
Note that we can separate the conditions using commas if we want flights that
are both headed to SFO **and** in February. If we are interested in either
flights headed to SFO **or** in February we can use the `|` instead of the comma.
1. Create a new data frame that includes flights headed to SFO in February,
and save this data frame as `sfo_feb_flights`. How many flights
meet these criteria?
1. Describe the distribution of the **arrival** delays of these flights using a
histogram and appropriate summary statistics. **Hint:** The summary
statistics you use should depend on the shape of the distribution.
Another useful technique is quickly calculating summary
statistics for various groups in your data frame. For example, we can modify the
above command using the `group_by` function to get the same summary stats for
each origin airport:
```{r summary-custom-list-origin}
sfo_feb_flights %>%
group_by(origin) %>%
summarise(median_dd = median(dep_delay), iqr_dd = IQR(dep_delay), n_flights = n())
```
Here, we first grouped the data by `origin`, and then calculated the summary
statistics.
1. Calculate the median and interquartile range for `arr_delay`s of flights in
in the `sfo_feb_flights` data frame, grouped by carrier. Which carrier
has the most variable arrival delays?
### Departure delays over months
Which month would you expect to have the highest average delay departing from an
NYC airport?
Let's think about how we would answer this question:
- First, calculate monthly averages for departure delays. With the new language
we are learning, we need to
+ `group_by` months, then
+ `summarise` mean departure delays.
- Then, we need to `arrange` these average delays in `desc`ending order
```{r mean-dep-delay-months}
nycflights %>%
group_by(month) %>%
summarise(mean_dd = mean(dep_delay)) %>%
arrange(desc(mean_dd))
```
1. Suppose you really dislike departure delays, and you want to schedule
your travel in a month that minimizes your potential departure delay leaving
NYC. One option is to choose the month with the lowest mean departure delay.
Another option is to choose the month with the lowest median departure delay.
What are the pros and cons of these two choices?
### On time departure rate for NYC airports
Suppose you will be flying out of NYC and want to know which of the
three major NYC airports has the best on time departure rate of departing flights.
Suppose also that for you a flight that is delayed for less than 5 minutes is
basically "on time". You consider any flight delayed for 5 minutes of more to be
"delayed".
In order to determine which airport has the best on time departure rate,
we need to
- first classify each flight as "on time" or "delayed",
- then group flights by origin airport,
- then calculate on time departure rates for each origin airport,
- and finally arrange the airports in descending order for on time departure
percentage.
Let's start with classifying each flight as "on time" or "delayed" by
creating a new variable with the `mutate` function.
```{r dep-type}
nycflights <- nycflights %>%
mutate(dep_type = ifelse(dep_delay < 5, "on time", "delayed"))
```
The first argument in the `mutate` function is the name of the new variable
we want to create, in this case `dep_type`. Then if `dep_delay < 5` we classify
the flight as `"on time"` and `"delayed"` if not, i.e. if the flight is delayed
for 5 or more minutes.
Note that we are also overwriting the `nycflights` data frame with the new
version of this data frame that includes the new `dep_type` variable.
We can handle all the remaining steps in one code chunk:
```{r}
nycflights %>%
group_by(origin) %>%
summarise(ot_dep_rate = sum(dep_type == "on time") / n()) %>%
arrange(desc(ot_dep_rate))
```
1. If you were selecting an airport simply based on on time departure
percentage, which NYC airport would you choose to fly out of?
We can also visualize the distribution of on on time departure rate across
the three airports using a segmented bar plot.
```{r}
qplot(x = origin, fill = dep_type, data = nycflights, geom = "bar")
```
* * *
## More Practice
1. Mutate the data frame so that it includes a new variable that contains the
average speed, `avg_speed` traveled by the plane for each flight (in mph).
**Hint:** Average speed can be calculated as distance divided by
number of hours of travel, and note that `air_time` is given in minutes.
1. Make a scatterplot of `avg_speed` vs. `distance`. Describe the relationship
between average speed and distance.
**Hint:** Use `geom = "point"`.
1. Replicate the following plot. **Hint:** The data frame plotted only
contains flights from American Airlines, Delta Airlines, and United
Airlines, and the points are `color`ed by `carrier`. Once you replicate
the plot, determine (roughly) what the cutoff point is for departure
delays where you can still expect to get to your destination on time.
```{r echo=FALSE, eval=TRUE, fig.width=7, fig.height=4}
dl_aa_ua <- nycflights %>%
filter(carrier == "AA" | carrier == "DL" | carrier == "UA")
qplot(x = dep_delay, y = arr_delay, data = dl_aa_ua, color = carrier)
```