S-Plus for the Analysis of Biological Data

Jones, Rhondda, Gilliver, Robin, Robson, Simon, and Edwards, Will (2014) S-Plus for the Analysis of Biological Data. James Cook University, Townsville, QLD, Australia.

[img] PDF (Published Version [Book]) - Published Version
Restricted to Repository staff only

[img] PDF (Chapter 20) - Supplemental Material
Restricted to Repository staff only

[img] PDF (Chapter 21) - Supplemental Material
Restricted to Repository staff only

[img] PDF (Chapter 22) - Supplemental Material
Restricted to Repository staff only

[img] PDF (Chapter 23) - Supplemental Material
Restricted to Repository staff only

 
10


Abstract

This manual is designed to teach people to use the statistical software S-Plus and to support the process of learning statistical concepts and methods. It is most useful as a workbook to accompany Whitlock and Schluter's The Analysis of Biological Data, published by Roberts & Company, Colorado. Although we include enough statistical background to put the procedures being demonstrated in context, we assume that readers will be acquiring most of their understanding of statistical concepts elsewhere.

Several of the authors of this manual have been teaching introductory biostatistics to undergraduate and postgraduate students on two campuses in Australia for more than a decade (in fact one of us, who would prefer not to be identified, taught a biostatistics course for the first time more than three decades ago). In 2008 we discovered the textbook The Analysis of Biological Data (referred to in this manual as ABD). We liked everything about the book: its explanations were beautifully clear and aimed at students much like our own; it used a wide variety of real biological examples; it emphasized concepts and procedures important to biologists and explained how they worked; and it introduced some newer computer-intensive techniques that almost all beginning researchers find themselves needing sooner rather than later. We immediately adopted the book as a text for our own introductory biostatistics course. But this adoption acted as a trigger for making some other changes to our teaching—and in particular, to the way we introduced students to statistical software.

To statistical novices, no statistical software is 'user-friendly', and its use needs to be introduced in a structured way which runs in parallel with their acquisition of statistical understanding. At the same time, teaching effort needs to stay focused on statistics rather than software, so that students do not come to see learning to use the software as their primary goal. This manual is intended to allow users to learn to use the software on their own, while keeping a focus on the concepts and procedures which it supports.

We have followed the ABD approach and layout very closely—indeed, we started out with the intention of simply demonstrating in S-Plus every example used in the body of that text. In the end, because everyone has a slightly different view of what should be included in a first statistics course, we added a number of other examples, mostly using our own data, to demonstrate software capabilities that would not otherwise have been covered.

Also we did not include material associated with ABD Chapter 20 (Likelihood) or with Chapter 21 (Meta-analysis) which are largely conceptual. Most of the computational procedures in Chapter 20 are covered elsewhere in the manual.

Why S-Plus?

There are a lot of statistical software options, and most of them will execute all the procedures needed in an introductory course. In choosing a software package, we had four criteria beyond its ability to execute procedures taught in the course.

* It should have little or no cost to students, and should run on operating systems that students are likely to use on their own machines. Some of us (OK, one of us) remembered etching statistics in the days when the only computing aid available to students was a hand calculator; the rest of us at least remember being taught that way. While we did not wish to return to those days, they had one huge advantage—students could work on the material anywhere and anytime, and not just in computer laboratories provided by the university. Many of our students are part-time, and some are in remote locations. While we can now reasonably expect that students will have access to a computer at home, we cannot reasonably expect them to buy expensive software for themselves. That meant that if we wanted students to work off-campus, we needed to choose software which was either free or very cheap, or which gave students access on their own machines as part of the university's site licence.

* It should be useful beyond the course. We wanted students to use professional-quality software that they would not 'grow out of' providing access to all or most of the techniques they were likely to use throughout their careers; and able to import and export data in a wide range of formats (including text files, databases, spreadsheets and other statistical software).

* It should have a very strong graphics capability. We wanted students to realise as quickly as possible that nothing substitutes for an intimate familiarity with the data they are analysing—and easily usable graphics allow the data to be explored more quickly and thoroughly than anything else. We wanted the graphics capability to cover the whole range from quick-and-dirty exploratory plots to presentation and publication-quality graphs.

* It should reinforce the statistical concepts we wanted students to grasp, and not get in the way of learning them. We wanted to avoid both excessive or inappropriate output, and too much difficulty in using the software itself. Excessive output is often a problem with menu-driven software, which may be relatively easy to use, but often provides pages of output that users neither asked for nor know what to do with. Especially for novices, our preference was for software that gives users exactly what they request and offers warnings (or refuses to perform) when that request is questionable. We believe that someone learning to use statistical procedures should also learn to think about what they are doing and work out exactly what it is they want, rather than making guesses about what button to click in the hope that something useful will happen. On the other hand, if software is too difficult to use, students will inevitably concentrate on learning the mechanics of how to use it rather than developing more fundamental understanding.

In the end we chose S-Plus as the best fit to our needs. That choice committed us to producing this manual: there are some excellent introductory books available for S-Plus, but none that we investigated is targeted at undergraduates who begin as complete statistical novices. S-Plus is very powerful and flexible, has superb editable graphics, and its site licence for universities gives enrolled students permission to use the software on their own computers. While it provides a professional-quality graphical user interface (GUI), it also provide a parallel introduction to the command line and to writing basic scripts.

Since we made this decision, a final issue has become more important. That is, the increasing importance of the open-source statistical software R in the biological research community. Learning the R language as a complete statistical novice is a hard ask for students, who are often having quite enough difficulty with statistical concepts. But S-Plus shares its command language with R—and our experience has been that students make the transition to R quite easily by the end of an introductory course based on S-Plus.

How to use the manual

If you are a student using ABD as a text, and you have access to S-Plus, you can use S-Plus to work through each chapter of the manual independently. Every example is demonstrated in enough detail for you to carry it out on your own after reading the ABD chapter and/or covering the statistical concepts in class. You should execute every example yourself to make sure that you can carry out the procedures correctly and get to the right result. A set of exercises is provided at the end of each chapter for you to test your skills. You should make sure that you can do all those marked essential—you may require assistance from your instructor to successfully complete some of them, possibly the advanced exercises. All the data and scripts required for each chapter are available in the resource material provided with the manual.

The first chapter of the manual is a basic introduction to S-Plus, and is one of a of a few chapters whose content is not linked to ABD. The second chapter introduces you to S-Plus graphics. The remaining chapters can be covered in several different orders, but you need to work through these two first. Not all the material in later chapters will necessarily be included in an introductory course.

In most chapters, we show how to execute statistical procedures using both the GUI and the command line. A few procedures require the use of scripts (short programs written in the S language). Where this is the case, we provide the scripts in the resource material, with them being reproduced in Appendix 2— and we show you how to load and use them (but we also encourage you to learn to write your own). In many cases, there are more efficient and elegant ways to write scripts than we have used here—in general, we have tried to produce scripts whose logic can be easily understood by beginners, rather than trying for maximum computational efficiency. Appendix 1 provides a more extensive summary of the S language and S-Plus functions relevant to each chapter of the manual.

To the instructor

We believe that learning statistics is like learning to play the piano—there is no substitute for practice. Consequently, in our own teaching, we provide a lot of incentives for students to practice.

In the introductory course that we teach, we expect students to have worked through the appropriate chapter(s) in the manual and attempted the exercises before they arrive at the relevant practical class or tutorial - and the first 20 minutes of each 2-hour practical class includes a simple open-book practical test, marked in class, which requires them to analyse some new data using techniques covered in the chapter. (By the end of the course, most students score marks on most of these tests). We also run formal (but also open-book) practical exams twice during the course, where the emphasis is on demonstrating that students can make sensible decisions about what to do as well as demonstrating that they can do it. These are also graded immediately. Because students can take this manual—or anything else—into practical tests and exams, we are explicitly not testing how well they remember what buttons to click.

When we first changed to this very assessment-orientated approach to the acquisition of practical skills, one unexpected result was that the average grade on the theory exam at the end of the course (which was in the same format and covered the same material as previously) was significantly higher than that achieved by any previous class; it has remained at this level in subsequent years. Perhaps the development of practical skills really does improve theoretical understanding.

Acknowledgements: As noted above, the structure and content of this manual owes a huge debt to Whitlock and Schluter's text, which provides the best introduction we know of to statistical methods for biology students. We are also very grateful to the students in our 2009 biometrics class, and especially to the practical class tutors (Clwedd Burns, Gavin Coombes, Rie Hagihara, and Philip Newey) whose combined input and feedback improved the manual immensely. Finally, for all his help our thanks to Kris Angelovski of Solution Metrics Pty Ltd, the Australian distributor of S-Plus.

Item ID: 33121
Item Type: Book (Teaching Material)
ISBN: 978-0-9875147-0-7
Keywords: SPlus, statistics, experimental design
Additional Information:

©2014. Copyright vests in the authors.

This manual was first produced in 2009.

Data sets are available from the authors.

Book contains chapters 1-19. Supplemental chapters are as follows:

Chapter 20: Introduction to generalized linear models.

Chapter 21: Generalized linear models: analysis of count data via Poisson regression.

Chapter 22: Generalized linear models: analysis of aggregated data via negative binomial regression.

Chapter 23: Generalized linear models: modelling changes in proportions with logistic regression.

Date Deposited: 02 Dec 2014 04:23
FoR Codes: 01 MATHEMATICAL SCIENCES > 0104 Statistics > 010402 Biostatistics @ 100%
SEO Codes: 93 EDUCATION AND TRAINING > 9302 Teaching and Instruction > 930203 Teaching and Instruction Technologies @ 50%
97 EXPANDING KNOWLEDGE > 970106 Expanding Knowledge in the Biological Sciences @ 50%
Downloads: Total: 10
More Statistics

Actions (Repository Staff Only)

Item Control Page Item Control Page