A data management system for electrophysiological data analysis

June 2, 2017 | Autor: Alexander Ecker | Categoria: Data Analysis, Data management system

Descrição do Produto

¨ biologische Kybernetik Max–Planck–Institut fur Max Planck Institute for Biological Cybernetics

Project Report

A Data Management System for Electrophysiological Data Analysis Alexander S. Ecker1∗ , Philipp Berens1,2 , Andreas S. Tolias1,3 August 2007

1

Physiology of Cognitive Processes, Department Logothetis Computational Vision and Neuroscience, Group Bethge 3 Baylor College of Medicine, Department of Neuroscience, Houston, TX ∗ email: [email protected] 2

A Data Management System for Electrophysiological Data Analysis Alexander S. Ecker, Philipp Berens, and Andreas S. Tolias

1

Introduction

Recent advances in both electrophysiological recording techniques and hardware capabilities have enabled researchers to simultaneously record from a large number of neurons in parallel in different areas of the brain (Buzsaki, 2004, for review). Recently, we have demonstrated that it is possible to monitor the activity of such ensembles of neurons in the awake primate for many days or even weeks (Tolias et al., 2007). While this makes possible a wide range of exciting and complex analyses potentially leading to a better understanding of the principles underlying neural network computations, it also poses additional challenges on data handling and management. Due to the increasing amount of data with increasing complexity, significantly more emphasis and diligence has to be put on the data analysis task. Although high-level scripting languages such as Matlab can speed up the development of analysis tools, in our experience, a too large amount of time is still spent on (re)structuring and (re)organizing data for specific analyses. This is not only time consuming but also likely to facilitate errors, some of which might go unnoticed by researchers and reviewers, making their way into scientific publications. We therefore believe it is mandatory to solve these problems in a more principled manner than on a day-to-day basis. In this report, we describe a newly developed data management system, specifically designed to address the daily needs of neurophysiological experimenters in an active and dynamic laboratory setting (Ecker et al., 2007). It supplies the user with basic data types and functions to organize and structure various types of electrophysiological data. While being implemented in Matlab, a high-level scripting language familiar to many neuroscientists, our system provides full flexibility, platform independence and extensibility. By using an object oriented, hierarchical layout, basic functionality, such as integration of meta data, or storage and retrieval of data and results, is implemented independent of specific data formats or experimental designs. This makes our framework easily adaptable to future experiments and new data formats from new recording hardware. All data and experimental results are stored in a database, so the experimenter can choose which data to keep in memory for faster access and which to save to disk to save resources. While several projects have pursued closely related goals over the last several years, none of them has succeeded in developing a system adapted by a large audience of researchers. We believe that this apparent failure is not only due to experimentalists not being willing to adapt to new software packages, but to a discrepancy between the goals and means of the projects. Commercial products suffer from their poor flexibility and adaptability due to their source code not being publicly available and usually high pricing (Nex Technologies, 2007, for an example). Among the most notable large scale open source projects in this field is the work by Robert et al. (2003, 2004) aimed at designing a web-based interface for data sharing in neuroscience. While this project considers, compared to our goals, the even wider scope of sharing data between laboratories, it illustrates the pitfalls of such an endeavor: To use the database system detailed knowledge about a special description language called BrainML is mandatory, the overall implementation is very technical, non-intuitive and in itself quite static. Only 11 dataset submissions have been made until July 2007. Other software packages such as MEA tools use an easily accessible implementation scheme, but fail at providing a principled and efficient way of storing and managing the data resources (Egert et al., 2002). Rather, they supply a wealth of preimplemented analysis functionality. Also DATA-MEAns and neurALC focus on the development of a graphical user interface with adaptable plug-ins, distributed in a precompiled way (Bonomini et al., 2005; Berenguer and Bongard, 2006). While providing researchers with some analysis functionality may be beneficial for inexperienced users, it also bears some dangers: Without full control over the steps taken in an analysis and a proper understanding of the underlying methodology, researchers may misinterpret results. 1

Data analysis fixation stimulus

spike density function

orientation tuning

...

power spectrum

Data organization · structure data · storage & retrieval of analysis results · easy access to collected data

Collected data spike sorting

waveforms

... local field potentials

Figure 1: Data organization provides the connection between data collected from neurophysiological experiments and data analysis. It provides means for efficient and flexible data analysis. It should be extensible while yet providing a clean interface for data integration.

Here we describe a system that facilitates efficient and robust development of analysis tools rather than providing ready-to-use analysis programs. We designed and implemented a data management system that integrates easily with various recording technologies and hardware configurations. The software layout is chosen to reflect our basic intuitions about how data is structured and accessed. Furthermore, it is designed to avoid the problems mentioned above. We describe the main components of our system in detail in section 3 and give some implementation details in section 4. In the development process we sought the ongoing feedback of practicing experimentalists working in our laboratory. We believe that this will make our system highly usable by researchers.

2 2.1

System design System goals

The ability of neurophysiologists to record ever increasing amounts of data provides new opportunities as well as new challenges for testing and investigating hypotheses about brain functions. Here we will be concerned with the problem of managing data, i. e. how to prepare and store data in a way that makes it most easy for researchers to access, analyze, and handle it. We argue that by separating data organization and data analysis, the development process can be made much more efficient and less error-prone while still providing researchers with maximum flexibility. Data organization is challenging since the amount of data can easily grow up to several hundrets of gigabytes that need to be quickly accessed under the pressure of limited resources such as working memory, network speed and processing power. Furthermore, the data can come from a multitude of sources. During a neurophysiological experiment, in addition to single neurons, other signals such as local field potentials or eye movements are recorded, each of which has to be handled in its own way. A data managment system should be able to integrate these data and provide them to an experimenter on demand, thereby taking care of several processing steps that are to date often done manually and lead to potential errors. Additionally, results of certain basic analyses may serve as the basis for further investigations. Being able to automatically store and later retrieve such results rather than recomputing it every time is of great help during the daily work. Figure 1 illustrates the concepts just described. 2

Data organization provides the connection between data and analysis, thereby enabling the latter to be as efficient and error-free as possible. While solving all these issues is important, our data management system is designed for a maximally intuitive handling experience and its main goal is to make life easier for an experimental neuroscientist. Therefore, we also bear in mind that laboratories are highly interactive environments and data access might be important for several users at once. In addition, ideas and hypothesis often need to be refined and so do analysis programs suited for testing them. As new techniques are being developed, new signal sources need to be handled by the data management system. While working with clear specifications and interfaces that all data have to fullfill, it is mandatory to design a system that allows flexible access on a daily basis as well as simple integration of new data types. 2.2

System Design

As shown in figure 1, we believe that seperating data analysis from data storage in a principled way will lead to an improved workflow. Providing the link beween data analysis and recorded data, the data management system has two naturally defined interfaces: data storage, where it links with recorded and preprocessed data, and data access, where it is in touch with the user. By clearly specifiying these two we achieve two goals: First, we are able to provide the user with a unique interface for data access that does not change even if the underlying storage system was modified. Second, new data sources and file formats can be easily integrated into the system (provided some functions that process the raw data to fullfill our specifications) without the need to change any higher level features of analysis programs, code in the core engines of the data management system, or alike.1 For every type of data, these specifications must be fullfilled. However, each data type needs some special handling or processing to be used properly. This behaviour can be guaranteed by making use of an object-oriented framework, where all data types are derived from the same abstract class. Thereby the specifications are enforced while room is left to implement any type of data-specific additional functionality. Also, this leads to a flexible system, because adding a new type of data source just requires a new object derived from a more general class, bringing with it the proper functionality to import and process the data. All other data types can be left untouched and no programs have to be changed. This principle will become apparent at many times when describing the details of the system’s design and implementation.

3

System Structure

In this section we will describe the core structure of our data managment system. First, we will introduce Elements and Data objects. These are effectively the building blocks of our proposed framework. Then, we will highlight on Context objects, that can be used to efficiently organize and access data in an intuitive way. Last, we will describe the DataContainer object, which is used to store active Elements and Data objects and implements the interaction with the storage engine. 3.1

Elements

An Element is the basic unit our system is built around. Intuitively, it can be thought of as having data attached to it and analysis performed on it. Examples include electrodes, neurons, single units and local field potentials. While this may seem to be a loosely defined concept, all of the mentioned examples are associated with one or more data types. In the case of a single unit, this might be spike times and spike waveforms as well as additional data obtained by analyzing various aspects of the recorded neural acitivity. Furthermore, Elements share the notion that they might be associated with additional information about them, which we call meta data. For an electrode, this might include the gain of the pre-amplifier, the electrode’s position in the brain, its material, or impedance. All of this information is potentially useful, but differs from recorded data, in that it rather specifies information needed for certain types of analysis (e. g. the pre-amplifier gain if voltages are to be reconstructed from digitized values) or groups Elements together (e. g. the material). Elements are defined by a unique identifier, a list of meta data and a list of Data objects, that belong to the Element (see figure 2). In can be viewed as a wrapper around Data objects, providing additional information and binding together Data objects of different kinds: Spike waveforms and spike times of a single unit are two different types of data, both derived from the same physical object. Therefore we believe it useful to integrate these two 1

A related advantage is that the system is independent of any specific form of low level storage enginge, such as a relational database.

3

Data element_id

Element id

id of element attached to

properties

globally unique identifier

property list (name/value)

metadata

data

property list (name/value)

data

D1 D1

...

actual data content

Dn

any number of data objects

Figure 2: Element and Data objects are the basic data types of our data managment system. Elements have data attached to it and analysis is performed on them. They are identified by a unique identification number and may contain additional meta information about them. Data objects contain raw or preprocessed data. They are attached to Elements. For further details, see text.

by attaching them to a common object, the abstract representation of a single neuron. This will then manage the relationship of the data to other kind of data and provide information relevant to both data types, like isolation quality of the neuron. By making the abstraction from all the seemingly different real objects like electrodes, neurons and so on to Elements, adding new Element types is not a problem. By specifying the kind of interactions an Element has to respond to in a meaningful and defined way, we make it possible to extend our system without anything but local modifications. 3.2

Data

Data objects contain actual data that has been collected or preprocessed, or analysis results obtained from other Data objects. Examples include spike times and spike waveforms as obtained from single neurons, or voltage traces such as, for example, the local field potential. Each Data object belongs to an Element and is linked to it via the Element’s id. It is possible to parameterize Data objects in order to account for different ways of preprocessing or parameters used during analysis. For example, voltage traces might be resampled to a lower sampling rate or band-limited to a certain frequency range by digital filtering. Figure 2 summarizes the contents of a Data object. Basic functionality such as storage for later use and retrieval from the database is implemented by the abstract parent class. Each Data object only has to implement a preprocessing function to import the data into the system and a function to access its contents. This way, new data types can be easily created without affecting the remaining parts of the system at all. In addition, Data objects can also add extra functionality such as selecting certain subsets, restructuring, or reordering its contents given specific constraints. 3.3

Contexts

Context objects establish relationships between Elements and structure the data that is available in the system. Figure 3 (left part) presents an intuitive way of structuring data. Usually, experiments are conducted in several experimental sessions. During each of these sessions, a certain number of electrodes is used to record neural activity. Each of these signals can be used as multi unit activity and to extract potentially multiple single units. This imposes a hierarchical, tree-like structure which is commonly used to organize the data one collected. This intuitive tree structure rests on the implicit assumption that experimental sessions are more or less independent of one another and their elements do not relate to each other in any specific way. If, however, as demonstrated by Tolias et al. (2007), the same neurons can be recorded across multiple days, the single units of each session need to be identified with a unique stable neuron—a nontrivial task if the structure is “hardcoded” to the system. Therefore, our system does not have any a priori structre among Elements. Context objects can be created to relate Elements to each other in almost arbitrary ways. A Context is a graph that uses a set of Elements as its nodes and links them together via edges. Additionally it provides various functions to access Elements or pass along and retrieve information. For instance, the SessionContext (figure 3, red subgraph) can be used to determine on what day a given single unit was recorded or how many other units were recorded 4

sessions

MultiUnit SingleUnit1

Tetrode1

stable neurons

SingleUnit2 Tetroden

Session1 Session2

MultiUnit

Neuron2

SingleUnit1

Tetrode1

SingleUnit2

Neuron3

SingleUnit3

... Sessionn

...

Neuron1

...

Tetroden

...

SingleUnit id

4711

metadata

number ® 3

data

spike times spike waveforms orientation tuning

Figure 3: Elements are organized in Contexts. These are flexible graph-like structures that organize the relations between different elements. Elements can be part of many Contexts. For further details, see text.

that day by querying for an Element’s parent or descendants. This way the amount of redundant information can be minimized. In contrast, to find out which single units on the preceeding or following days represent the same neuron in the brain, a StableNeuronContext would be needed (figure 3, blue subgraph). Again, arbitrary new relations between Elements can be constructed by creating new Context objects and without changing the remaining parts of the system at all. 3.4

Database

The database is used to store all Elements, Data objects, and Contexts (in the form of directed edges). A clean interface of how to insert, retrieve, and delete objects is specified. Our current implementation uses the free, open source MySQL database as a storage engine but other storage engines are straightforward to implement. 3.5

DataContainer

The DataContainer binds all previously discussed objects together. All Elements, Data objects and Contexts as well as the Database object are stored within and accessed through it. It provides functionality to add, load, and access other objects. In addition, it interfaces the structures accessed by the user with the storage engine.

4

Implementation

In this section we briefly describe some relevant implementation details and discuss problems arising due to the very limited use of special software and programming technologies. 5

4.1

Programming language and software

As we elaborated above, the system is intended to serve as a basis for neuroscientists to develop their own customtailored tools to analyze complex electrophysiological data. Therefore, we decided to implement the system in Matlab and use as little as possible additional programming languages or software. Whenever we did use additional technologies, we tried to abstract it away such that the user does not have to interact with it at all. These decisions have been made for two simple reasons: First, Matlab is relatively easy to learn and used throughout most laboratories, making it the quasi-standard programming language for such a task. Second, many neuroscientists have backgrounds in biology or psychology and have only limited programming skills. All objects described in the previous section are implemented as Matlab classes. Since Matlab is a scripting language primarily intended for prototyping and numerical computations, its object oriented programming capabilities have only been added quite recently and do not provide all features of modern object oriented programming languages. Most notable are its lack of static functions and the fact that all function arguments are passed by value. As a consequence, all objects passed to a function that are potentially modified have to be returned by the function as return values. Also, objects that are retrieved from containers (such as the DataContainer) and modified have to be written back to the container afterwards in order to make the changes permanent. This can be cumbersome at times and also a potential source for errors. However, it is not a serious problem since one gets used to this style of programming very quickly and debugging is not very hard in this very special case. Also, the lack of static functions mentioned above is easy to work around by creating a “dummy” object without any actual content to issue the function call. The free, open source MySQL database is used as storage engine and accessed through a modified version of the mYm MySQL wrapper functions for Matlab (Maret, 2007). Since database communication is hidden in the Database class, the user does not have to directly interact with the MySQL database or write SQL queries. 4.2

Setting up the structure

So far we have only described how data is imported into the system, stored and retrieved as well as how structure and meta data are represented. However, a pratical issue to be considered is how are Elements created and how does meta data or information about the structure get into the system in the first place? Although creating Elements, entering meta data, and building links manually might be feasible for small amounts of data, this approach soon becomes very tedious and at some point impracticable. Furthermore, Elements have to be uniquely identifiable in order to prevent multiple entries of the same Element into the database. Therefore, we decided on the following scheme: Each type of Element is created by exactly one type of Context. This is a necessary and unavoidable constraint for reasons that become apparent when considering the problem of uniquely identifying Elements. An isolated Element alone does not provide enough information to do so. For instance, the identity of a single unit is established by knowing the electrode it was recorded from on which date. This information is contained in the SessionContext and, hence, a SingleUnit object can only be created by a SessionContext. In our implementation, identity is encoded in a hash value that is computed based on an Element’s class name, meta data and the hash values of adjacent Elements in the given Context. A Context is created for the first time by calling its import function which creates all Elements and links. In our implementation, the SessionContext is the basic Context which creates Sessions, Tetrodes, SingleUnits, MultiUnits, and FieldPotentials. A list of which session are available, which tetrodes were used, and meta data is supplied to the session Context in an easily writable XML document. Elements and Contexts are automatically stored in the database once imported so that they can be easily loaded later.

5

Discussion

In this report, we described the design, development, and implementation of a data managment system aimed at facilitating neurophysiological data analysis. By making a clear distinction between data storage and data analysis we were able to create a clean interface between the two that takes care of all neccessary interactions between user and data. Furthermore, we took great care to make using it an intuitive experience and adapted the system to the needs of experimental neurophysiologists. We achieved extensibility by using an object-oriented software design. As such a system is very complex, even extensive tests with surrogate data can never reveal all possible sources of errors. Therefore, we plan to subject it to a phase of real life testing. As a pilot project we will use it in two half-year projects performed at Baylor College of Medicine in Houston, TX. We hope that these situations will reveal possible faults and highlight possiblities for improvement. Upon completion and enhancement, the data managment system described in this report will be introduced to active groups of neuroscientists at the Max Planck 6

Institute for Biological Cybernetics in T¨ubingen and at Baylor College of Medicine in Houston, TX. Using their feedback we plan to make the software publicly avaible by mid 2008. Currently, it is planned to be released under the GPL2 as open source software. Acknowledgments We thank the MFG Stiftung Baden-W¨urttemberg for their financial support through the Karl Steinbuch Scholarship to Alexander Ecker and Philipp Berens and the Max Planck Society for hosting us. In addition, we would like to thank Georgios Keliris and James Cotton for stimulating discussions and feedback.

References V. Berenguer and M. Bongard. neurALC. Website, 2006. URL http://neuralc.sourceforge.net/. M. P. Bonomini, J. M. Ferrandez, J. A. Bolea, and E. Fernandez. DATA-MEAns: an open source tool for the classification and management of neural ensemble recordings. J Neurosci Methods, 148(2):137–146, Oct 2005. G. Buzsaki. Large-scale recording of neuronal ensembles. Nat Neurosci, 7(5):446–451, May 2004. A. S. Ecker, Berens P., G. A. Keliris, N. K. Logothetis, and A. S. Tolias. A data management system for electrophysiological data analysis. In Proceedings of the 7th Meeting of the German Neuroscience Society, pages T38–5C, 2007. U. Egert, T. Knott, C. Schwarz, M. Nawrot, A. Brandt, S. Rotter, and M. Diesmann. MEA-Tools: an open source toolbox for the analysis of multi-electrode data with matlab. J Neurosci Methods, 117(1):33–42, May 2002. Yannick Maret. mYm. MySQL wrapper for Matlab. Website, 2007. URL http://sourceforge.net/ projects/mym/. Nex Technologies. Neuroexplorer. Website, 2007. URL http://www.neuroexplorer.com/. A. Robert, M. Abato, K. H. Knuth, and D. Gardner. Neuroscience data sharing i: Interfaces, incentives and internals for interoperability. In Society for Neuroscience Annual Meeting, 2003. A. Robert, A. Jagdale, D. H. Goldberg, and D. Gardner. Human brain project resources enabling data discovery for neuroscienes. In Society for Neuroscience Annual Meeting, 2004. A. S. Tolias, A. S. Ecker, A. Siapas, A. Hoenselaar, G. A. Keliris, and N. K. Logothetis. Recording chronically from the same neurons in awake, behaving primates. Journal of Neurophysiology, Under Review, 2007.

7

All in-text references underlined in blue are linked to publications on ResearchGate, letting you access and read them immediately.

Lihat lebih banyak...

A data management system for electrophysiological data analysis

Descrição do Produto

Comentários