--- title: "Analyzing plant disease epidemics with the R package epiphy" author: "Christophe Gigot" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteEncoding{UTF-8} %\VignetteIndexEntry{Analyzing plant disease epidemics with the R package epiphy} %\VignetteEngine{knitr::rmarkdown} editor_options: chunk_output_type: console bibliography: biblio.bib csl: phytopathology_v2.csl --- ```{r setup, include=FALSE} knitr::opts_chunk$set(fig.width = 7) # To avoid startup messages when loading epiphy hereinafter: suppressPackageStartupMessages(library(epiphy)) vers <- packageVersion("epiphy") ``` ## Abstract This paper presents the R package **epiphy** which provides a common framework for spatialized plant disease intensity data collected at one or more time points. Many statistical methods developed over the last decades to describe and quantify plant disease epidemics are implemented. The paper is an introduction to the framework concepts and the provided tools using different sample codes to illustrate possible workflows. **Keywords:** plant disease epidemics, phytopathology, spatial data, spatial aggregation, aggregation index, beta-binomial distribution, Taylor's power law, binary power law, spatial hierarchy analysis, SADIE, R. ## Introduction Performing analyses of plant disease intensity data is a common task for many field phytopathologists. Different softwares have been developped to make computationally available some statistical methods, but there are separated programs and are sometimes restricted to a specific computer system. This paper describes the version `r paste(vers$major, vers$minor, vers$patch, sep = ".")` of the package **epiphy** for R [@R_core_team_2015], which offers an uniform and coherent framework to perform spatial analyses in plant disease epidemics. Efforts were made to ensure that the users can fluently and fluidly carry out the data analyses that meet their needs, making it possible to generalize and automate such tasks, and piece together a sequence of operations, while limiting the need for reimplementing methods described in the scientific literature to save time and reduce the potential for error. Another key advantage of this package is to allow users unfamiliar with these methods (e.g. students, other scientists) to use them with safeguards against misuses of methods specific for a given kind of data set for instance. While firstly intented to be a toolbox for phytopathologists, implemented methods may be easily translatable to other data contexts. This package consists of three components: a bundle of historical data sets in plant disease epidemiology, a set of relevant data classes to reliably and efficiently handle date sets, and statistical methods developped over the last few decades to extract information from collected data sets. Information about these different components are given in this paper, but the main amphasis is made on the practical tools using code examples. As the **epiphy** package is not yet available on CRAN, one needs to use the following lines to install it: ```{r load_pkg, message=FALSE, warning=FALSE, eval=FALSE} install.packages("devtools") # If not already installed. devtools::install_github("chgigot/epiphy") # Note: Same command for the updates. library(epiphy) ``` ## Data sets The package **epiphy** is provided with a bundle of more than 10 historical data sets that were mainly published in plant disease epidemiology literature. There is, for example, counts of arthropods made in a wheat field in UK in 1996 [@Holland_etal_1999] or incidence of tomato spotted wilt virus (TSWV) disease recorded in field trials at the Waite Institute (Australia) in 1929 [@Cochran_1936]. These two data sets (called `arthropods` and `tomato_tswv$field_1929` in **epiphy**) will be used throughout this paper. To take a look at all the available data sets, type `data(package = "epiphy")`. Each data set is supported by relevant documentation specifying briefly the context of data collection, the data structure and the published references. Note that you do not need to use the function `data` to load a provided data set because all of them are already lazily loaded when **epiphy** is loaded. This means that they will not occupy any memory until you use them. ```{r load_data_sets} str(arthropods) str(tomato_tswv$field_1929) ``` In these two examples, $x$ and $y$ correspond to the spatial coordinates of the sampling units distributed in a regular two-dimensional grid. Specifically $x$ and $y$ are the row id and the within-row sampling unit id, respectively. $t$ variable stands for the assessment time or date. There are six and three recording dates for the counts of arthropods and the TSWV incidence data sets, respectively. $i$ variable corresponds to the number of recorded individuals (arhtropods or diseased plants) in each sampling unit. $n$, which is only there in incidence data sets, is the total number of individuals in a sampling unit. As in the raw TSWV incidence data set, $n = 1$ everywhere, this means that each sampling unit contains only one plant and so $i$ can only be equal to 0 (the plant is healthy) or 1 (the plant is diseased). ## Conceptual framework In order to collect spatial plant disease data, an appropriate sampling unit (or quadrat or sample unit or cluster) corresponding to a location where assessments are carried out must be chosen. It may be a plant unit (typically a leaf in the case of foliar pathogens), an indivudal plant or a cluster of nearby plants. The different ways of recording disease levels (or disease intensity) were sometimes confusing in the literature. We strive to stick to the nomenclature proposed by @McRoberts_etal_2003 where the count of visible symptoms (such as lesions), the presence or absence of disease, and the assessment of the proportion of plant tissue diseased correspond to count, incidence and severity data, respectively. To reliably handle such kinds of observational data sets, the package **epiphy** relies on the definition of a set of relevant classes. There is a mother class, named `intensity`, which makes it possible to format disease data sets for further analyses and check that everything is fine with the provided data (e.g. the input data must be a data frame). Object creation is only possible for one of the three subclasses of `intensity` which are `count`, `incidence` and `severity` for eponymous kinds of data. (Note that there is no implemented methods for the `severity` class at the moment.) Count data are integers starting from zero with no upper limit, while incidence data differ only by a upper limit set equal to the number of individuals in the sampling unit. Severity data correspond to percentages ranging from 0 to 100%. When creating an object of one of the subclasses of `intensity`, it is necessary to perform variable mapping to describe how variables in the input data frame will be mapped to spatial, temporal and observational properties of the analysis methods described in a later section. The reserved variable names for spatial information correspond to the three spatial dimensions, `x`, `y` and `z`. (Note that no currently implemented method deals with the third dimension `z`). `t` is used to map temporal information. Finally, `i` and `n` are reserved for the so-called observational properties. They stand for recorded intensity and number of individuals in a sampling unit, respectively. Variable mapping can be implicit, if (some of) the reserved names are already present in the column names of the input data frame, or explicit, if the user make the links between the reserved names and the column names using the function `mapping`. Note that this paradigm is similar to the one used in **ggplot2** package with the function `aes`. ```{r create_intensity} # Count data # We will use only the last assessment date for the arthropods data set: arthropods_t6 <- arthropods[arthropods$t == 3, ] # - Explicit mapping: (cou_t3 <- count(arthropods_t6, mapping(x = x, y = y, t = t, i = i))) # - Total implicit mapping: cou_t3_bis <- count(arthropods_t6) # - Partial implicit mapping: cou_t3_ter <- count(arthropods_t6, mapping(i = i)) all(identical(cou_t3, cou_t3_bis), identical(cou_t3, cou_t3_ter)) # Implicit mapping for incidence data: (inc <- incidence(tomato_tswv$field_1929)) ``` Some useful information are displayed when you print an `intensity` object, such as the exact nature of this object (`count`, `incidence` or `severity`). Mapping variables (with squared brackets) and mapped variables (just below mapping variables) are also printed. You can also plot such objects to vizualize all the data in a convenient way. ```{r plot_count} plot(cou_t3, tile = FALSE, size = 5) ``` **Figure 1.** Observation sequences of maps of of the counts of arthropods over time. It is possible to perform useful data transformation directly with `intensity` objects. For example, the `clump` function can be used to regroup adjacent sampling units into bigger ones, and thus redefine what is a sampling unit in a given data set. An extended version of `split` was also implemented to deal with `intensity` objects in an efficient way. In addition, you can use `as.data.frame` anytime you want to retrieve the underlying data frame of an `intensity` object (without any mapping). ```{r utilities_intensity, fig.show = "hold"} inc9 <- clump(inc, unit_size = c(x = 3, y = 3)) plot(inc) plot(inc9) ``` **Figure 2.** Observation sequences of maps of TSWV incidence data over time for two definition of a sampling unit. A sampling unit contains either only one tomato plant (**above**), or a set of 9 plants (**below**). In each case, the three maps correspond to the same field at different dates. ```{r fig.width = 3, fig.show = "hold"} inc9_t1 <- split(inc9, by = "t")[[1]] inc9_t1_sub <- split(inc9_t1, unit_size = c(x = 4, y = 5))[[6]] plot(inc9_t1) plot(inc9_t1_sub) ``` **Figure 3.** Different sub-parts of the TSWV incidence data set, with only what was observed for the first scoring time (**left**) and for a subpart of this same scoring time (**right**). ## Statistical methods A collection of statistical methods has been implemented in **epiphy**. At the moment, the available tools include several indices of aggregation (e.g. Fisher's, Lloyd's and Morisita's indices), distribution fitting to reveal any spatial aggregation in data sets, power law analysis (Taylor's and binary forms) and an early version of Spatial Analysis by Distance IndicEs (SADIE). We strived to mention most relevant scientific literature related to each method in the corresponding R help pages. Most of the time, a function dedicated to some methods is clever enough to know which "flavor" of the method needs to be used with the provided data set. For example, if you use the function `power_law` with a `count` data set, the regular Taylor's power law will be called, whereas in the case of `incidence` data, it will be the binary form of the power law. That is the other reason why, in addition to performing initial compliance tests, a set of dedicated classes was implemented to handle different kinds of `intensity` data sets. In any case, the function outputs will let you know what flavor was used to perform the analysis. ### Aggregation indices The index of aggregation for `incidence` data is calculated by default when the `agg_index` function is used with such a data set. ```{r agg_idx} (inc9_t1_idx <- agg_index(inc9_t1)) ``` If this function is called with `count` data, the corresponding version of this index (also called Fisher's index of aggregation) is calculated. Other indices may be calculated with `agg_index`, such as Lloyd's index of patchiness and Morisita's coefficient of dispersion. The index calculated by default can be tested using a chi-squared test, a z-test or a c($\alpha$) test. ```{r agg_idx_test} chisq.test(inc9_t1_idx) z.test(inc9_t1_idx) calpha.test(inc9_t1_idx) ``` In this example, the null hypothesis of non-aggregation is rejected. ### Distribution fitting As its name implies, the function `fit_two_distr` try to fit two different distributions to a given data set. One distribution is supposed to be representative of a random pattern, while the second one should denote an aggregated pattern. For `count` data, the default distributions are Poisson and negative binomial for random and aggregated patterns, respectively. For `incidence` data, the default distributions are binomial and beta-binomial for random and aggregated patterns, respectively [@Hughes_Madden_1993; @Madden_Hughes_1995]. In the latter case, `fit_two_distr` may be viewed as an alternative to the BBD software [@Madden_Hughes_1994]. Note that **epiphy** provides also a set of handy functions to work with the beta-binomial distribution (`dbetabinom`, `pbetabinom`, `qbetabinom` and `rbetabinom`). ```{r fit_distributions, warning=FALSE, fig.width=3, fig.show = "hold"} cou_t3_distr <- fit_two_distr(cou_t3) summary(cou_t3_distr) inc9_t1_distr <- fit_two_distr(inc9_t1) summary(inc9_t1_distr) plot(cou_t3_distr, breaks = 17) plot(inc9_t1_distr) ``` **Figure 4.** Frequency distributions of the count of arthropods in a wheat field in UK on 12 July 1996 (**left**), and the incidence of TSWV disease in a tomato field in Australia on 18 December 1930 (**right**). The assessments of arthropods counts and TSWV incidence were reported by @Perry_etal_1999 and @Cochran_1936, respectively. The black bars represent observed frequencies, the grey bars represent expected aggregated frequencies (negative binomial on the **left** and beta-binomial on the **right**), and the white bars represent expected random frequencies (Poisson on the **left** and binomial on the **right**). ### Power law Taylor's power law can be used to assess the overall degree of heterogeneity in a collection of `count` data sets at the sampling-unit scale [@Taylor_1961]. A binary form of this power law was proposed to deal with `incidence` data [@Hughes_Madden_1992]. Taylor's and binary power laws describe the relationship between the observed variance of diseased individuals (or individuals of interest) within a data set and the corresponding variance under the assumption that the data have a random distribution distribution (i.e., Poisson and binomial for `count` and `incidence` data, respectively). For the sake of illustration, the count of arthropods will be splitted into data sets of 9 sampling units each (3 rows $\times$ 3 sampling units $\times$ 1 recording date) before performing Taylor's power law analysis on this data set. To also give an example use of the binary form of the power law, we will split the TSWV incidence data into data sets of 20 sampling units each (4 rows $\times$ 5 sampling units of 9 plants each $\times$ 1 recording date) in order to also simulate a collection of different data sets. ```{r power_laws, fig.width=3, fig.show = "hold"} cou <- count(arthropods[arthropods$x <= 6, ]) cou <- split(cou, unit_size = c(x = 3, y = 3)) cou_plaw <- power_law(cou) coef(summary(cou_plaw)) inc9_spl <- split(inc9, unit_size = c(x = 4, y = 5)) inc_plaw <- power_law(inc9_spl) coef(summary(inc_plaw)) plot(cou_plaw) plot(inc_plaw) ``` **Figure 5.** Relationship between the logarithm of the observed variance and the logarithm of the theoretical variance for counts of arthropods carryied out in UK (**left**) and incidence data of TSWV disease collected in Australia (**right**). Solid lines indicate the linear relationship (on logarithmic axes) between observed and theoretical random variances. Dashed lines indicate the cases where both variances are equal (which suggests an absence of aggregation). ### Spatial hierarchy To carry out spatial hierarchy analyses [@Hughes_etal_1997], it is necessary to prepare the existing data sets. To do so, the `threshold` function is of primary interest. As in graphics editors, it allows to "simplify" the image in the sense that every value below and above a given threshold is given the value 0 and 1, respectively. By default, everything above 0 is given 1, and 0 stays at 0. `threshold` is thus useful to report a whole sampling unit as "healthy" (0), if no diseased individual at all was found within the sampling unit, or "diseased" (1) if at least one diseased individual was found. ```{r threshold_function, fig.width = 3, fig.show = "hold"} plot(inc9_t1) plot(threshold(inc9_t1)) ``` **Figure 6.** Disease incidence of TSWV for sampling units consisting in 9 tomato plants, at the plant level (**left**) and the sampling unit level (**right**). These figures were made using the intensively mapped TSWV incidence data reported by @Cochran_1936 for the first assessment performed on 18 December 1929. For the sake of illustration, the TSWV incidence data reported by @Cochran_1936 will be first splitted into data sets of 20 sampling units each (4 rows $\times$ 5 sampling units of 9 plants each $\times$ 1 recording date) to simulate a collection of different `incidence` data sets. Then, disease incidence at the sampling unit level will be calculated, before performing a spatial hierarchy analysis. ```{r spatial_hierarchies} inc_low <- split(inc9, unit_size = c(x = 4, y = 5, t = 1)) inc_high <- lapply(inc_low, threshold) (inc_sphier <- spatial_hier(inc_low, inc_high)) plot(inc_sphier) ``` **Figure 7.** Relationship between the incidences of TSWV disease at the tomato plant and sampling unit level (made of 9 plants) in a two level spatial hierarchy where the sampling unit is the highest level and the plant is the lowest level. Dashed curve is the binomial fit to the data and the solid curve is the beta-binomial fit to the data. This graph is based on 24 data sets of the incidence of TSWV disease collected in 1929 in field trials in Australia. ### Spatial Analysis by Distance IndicEs (SADIE) This two-dimensional geostatistical approach uses the relative locations of the sampling units and the number of diseased individuals per sampling unit to quantify the spatial arrangement of diseased individuals by calculating the distance to regularity [@Perry_1995]. Regularity is defined as the state where each sampling unit of a given data set contains the same number of diseased individuals (i.e., the mean number of diseased individuals for this data set). The SADIE procedure uses a transportation algorithm to calculate the distance to regularity, and performs a randomization test to determine if an observed distance to regularity is particularly small or large. **epiphy** implements an early version of a cross-plateform SADIE procedure. To perform a SADIE analysis, the spatial coordinates must reflect the real relative distances between the different sampling units. If you mapped $x$ and $y$ variables to grid coordinates, you can use the `remap` function to map them to metric coordinates (if any in your data set). ```{r sadie, fig.height = 5, fig.show = "hold"} set.seed(123) cou_t3_m <- remap(cou_t3, mapping(x = xm, y = ym)) plot(cou_t3_m) res <- sadie(cou_t3_m) summary(res) plot(res) plot(res, isoclines = TRUE) ``` **Figure 8.** Maps of clustering indices with index symbols alone (**top**) or with interpolated landscape and contours (**bottom**). For the **top** map, symbols filled with blue (receivers) and red (donors) color indicate that absolute values of Perry's indices are > 1.5. ### Map comparison (MAPCOMP) The MAPCOMP procedure proposed by @Lavigne_etal_2010 relies on the calculation of the Hellinger distance between the density map of recorded intensity data and the density map of sampling effort. ```{r mapcomp, fig.height = 5} set.seed(123) res <- mapcomp(cou_t3_m, delta = 4, bandwidth = 60) res plot(res) ``` **Figure 9.** Density map. ## Conclusion The package **epiphy** implements currently many of the methods described in the chapter 9 of the book "The study of plant disease epidemics" [@Madden_etal_2007]. We hope that such statistical methods packaged in a consistent way to be easily used in an open statistical environment will facilitate their use and spread in the phytopathology community, and even beyond. ## Acknowledgments The authors are grateful to Prof. Xiangming Xu for discussion and advice regarding the SADIE procedure. ## Annexes Most of the functions in the package **epiphy** have been designed to be compatible with pipeline coding. Using the package **magrittr**, you can pipe the analyses as in the following examples. ```{r pipe_analyses, warning=FALSE} library(epiphy) library(magrittr) incidence(tomato_tswv$field_1929) %>% split(by = "t") %>% getElement(1) %>% # To keep the first assessment time. clump(unit_size = c(x = 3, y = 3)) %>% fit_two_distr() %T>% plot() %>% summary() ``` For information, below are the same analyses without pipes. ```{r without_pipes, eval=FALSE} my_data <- incidence(tomato_tswv$field_1929) my_data <- split(my_data, by = "t") my_data <- my_data[[1]] my_data <- clump(my_data, unit_size = c(x = 3, y = 3)) my_res <- fit_two_distr(my_data) plot(my_res) summary(my_res) ``` Here is another example: ```{r pipes2, warning=FALSE} count(arthropods) %>% clump(unit_size = c(x = 3, y = 3)) %>% split(by = "t") %>% lapply(agg_index) %T>% (function(x) plot(sapply(x, function(xx) xx$index), type = "b", xlab = "Observation sequence", ylab = "Aggregation index")) %>% sapply(function(x) chisq.test(x)$p.value) ``` ## References