3) Organization
General notes
Research data is valuable for researchers and forms the basis for their research. Therefore, it is advisable to structure the data well to save time and effort in the daily handling of research data. In this part of the workshop, we will look closer at organizational aspects of data management, mainly the folder structure, file and folder naming, and file formats.
A clear and consistent folder structure and folder and file naming convention are important for making your data findable and interoperable. You should think about it beforehand in order to avoid inconsistencies or the need to rename large amounts of data.
Your structure and your naming conventions should be intuitive. However, we recommend to explicitly describe them (typically in a README file) because they may not be that intuitive for others or your future self (“why did I do it like this?”).
In the following sections, you’ll find some input on the organizational aspects you should consider. Note that not all of them may apply to each dataset. Besides the tasks, we’ll provide some general hints and rules. Some rules only apply to some use cases, and sometimes, there are good arguments for not sticking to every rule. However, in such cases you should know (and potentially document) why you decide differently.
Folder structure
At the start of your research project, you have to decide how to arrange your files and folders. This decision depends on the structure of your data and documentation. Organizational choices may involve trade-offs, such as the number of files per folder versus folder depth, intuitive names versus strict naming conventions, and structuring by processing level, access permissions, file size, or other criteria.
Depending on the operating system, the total path length has an upper limit, e.g. 255 characters. Exceeding this limit will cause errors. Also note that the path of the copy may be even longer than your original path if you synchronize or backup your data, which can cause your sync or backup job to fail. Therefore, try to keep your full path clearly below such upper limits.
- Bad example:
X:/Projects/Microscopy_Project/Microscopy_Projects_2024/October_2024/RawData_October2024/Microscopy_RawData_Image003.tif
- Better:
X:/Projects/Microscopy/2024-10/RawData/Image003.tif
- Avoid deeply nested folder structures: SubSubSubSubSubFolders can be pretty inconvenient.
- Avoid too many files or subfolders within one folder:
It can be quite inconvenient to look through dozens of heterogeneous file names. In case of clearly structured file names (e.g. numbered files likeImage003.tif
orPlot01_Part03.tab
), a larger number of elements per folder can also be fine. However, for huge amounts of files (several thousand), the performance of the file explorer may decrease. - In case different project members should have different access restrictions to files, this could also be considered in your folder structure.
Look at the folder structure (but not at the details of the folder or file names yet, which will be the next task).
- Is the folder structure intuitive and logical (what is done, how, and why)?
- Is it explicitly described? Where can you find this information (metadata of repository or in a README file)?
- Discuss: What would you leave as it is, what would you change, or what are the alternatives?
In case there are no folders, you may discuss whether it would make sense to add folders.
The dataset has 42 files, but no folder structure. Folders are not needed here, because all files (except for the README file) are of same type, just for different months. However, one could make one subfolder per year.
The dataset contains 6 files, whithout folder structure. However, 2 of them are of type ‘tar.gz’, which contain compressed ASCII files. The content is described in the README file. Also the tar.gz files do not contain many files, thus no further folder structure is needed.
Following notes relate to the content of OSF Storage
.
- Yes, files are grouped into data, code (scripts) etc.
- The content is described in the README file, but not completely.
On dataset-level, there is only one zip file, no folders. However, within the zip file, there is a folder structure:
- Yes, intuitive structure: separation between data tables and scripts, …
- Explicitly described in README file.
File and folder names
In the next section, we will explore best practices for file and folder naming to create a clear and organized data structure. File or folder names have the following primary purposes:
- Always: Uniquely identify the file or folder (within a folder),
- Often: Give information about its content, e.g.
README.txt
,MeetingProtocol.docx
,Temperature_RawData.tab
, - Sometimes: Enable logical order when sorting alphabetically, e.g.
1_RawData
,2_PreProcessed
,3_Processed
,4_Combined
.
Generally, the same rules apply to the naming of folders and files (except for the file extension after the dot, e.g. “README.txt”). They shall allow to choose the desired file amongst all the other files of the folder. Therefore, the names should be concise and intuitive (if applicable). For instance, a file named XYZ123
might not be immediately clear, so it’s important to explain its purpose somewhere, typically in a README file. Well-structured folders have clear naming conventions, which are explicitly described.
Depending on the operating system and application, some characters are forbidden or may lead to problems and, thus, should be avoided.
- Very bad: Any non-ASCII character, e.g.,
öäüßµαδ°±•€→☺É
- Bad: Any whitespace character, e.g.
File 1.txt
. They can cause problems, e.g., in some batch tasks, in particular, if one forgets to surround the name with quotes. Furthermore, double or multiple spaces and spaces at the beginning of the name are not clearly visible. - Forbidden in Windows:
\/:*?"<>|
- Also not recommended:
,;()[]{}
etc.
To summarize: You should only use Latin letters A-Z, a-z, digits 0-9, underscore, hyphen and dot, i.e. following characters: ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz123456789_-.
Furthermore, the dot should only be used in file names, and there only once before the file extension, e.g. Notes.txt
. Some programs use a dot or underscore as the first character for special file types, e.g. _quarto.yml
or .git
and thus should be avoided for regular data files.
Ensure that subfolders and files have unique names within a folder, even in case-insensitive ways. For example, do not put two files named hello.txt
and Hello.txt
in the same folder.
This note is particularly relevant for Linux users, where putting both files in the same folder is possible. However, in Windows, that is not allowed. Thus, sharing such a folder between users of different operating systems would cause problems.
A naming convention can enable a logical order of the file or folder names when sorting them alphabetically. Here, we provide some tips:
- When names include numbers, leading zeros are often helpful:
- Ordering with “0”:
Scan01.csv
,Scan02.csv
,Scan03.csv
,Scan04.csv
,Scan05.csv
,Scan06.csv
,
Scan07.csv
,Scan08.csv
,Scan09.csv
,Scan10.csv
,Scan11.csv
,Scan12.csv
- Ordering without:
Scan1.csv
,Scan10.csv
,Scan11.csv
,Scan12.csv
,Scan2.csv
,Scan3.csv
,
Scan4.csv
,Scan5.csv
,Scan6.csv
,Scan7.csv
,Scan8.csv
,Scan9.csv
- Ordering with “0”:
- Timestamps should always be given with a leading zero and ‘from big to small’, i.e. year, month, day of month, hour, minute, second. This recommendation complies with the international format ISO 8601 (e.g. “2024-07-31”, “2024-07-31T2313”).
- Very bad:
13Jan2024
,21April2021
,3Dec2025
- Also bad:
03122025
,13012024
,21042021
- Good:
2021-04-21
,2024-01-13
,2025-12-03
- Also ok:
20210421
,20240113
,20251203
- Including time of day:
20210421T0345
,20240113T1730
,20251203T1900
for 03:45, 17:30, 19:00
- Very bad:
- Include relevant information in the file name. However, don’t misuse a file name as a way to store all your metadata.
- Avoid overly long names (a maximum of 32 characters is suggested). Mind also the previous note about the full path length.
- Avoid moving or renaming folders or files. This is especially relevant when you or others have referred to the file by using its file name or path.
- Generate a README file explaining file nomenclature (including the meaning of acronyms or abbreviations), file organization and versioning. Store this file on top of the folder structure for easy accessibility.
There are different possibilities to indicate logical units in a name without using a whitespace:
- Kebab-case:
The-quick-brown-fox-jumps-over-the-lazy-dog.txt
- CamelCase:
TheQuickBrownFoxJumpsOverTheLazyDog.txt
- Snake_case:
The_quick_brown_fox_jumps_over_the_lazy_dog.txt
- Not recommended:
The quick brown fox jumps over the lazy dog.txt
- Please don’t do use spaces!
Compromises often have to be made, such as including relevant information versus avoiding long names. Note that folder names with a precise and narrow meaning may become outdated when further content is filled in over time. Because of that, persistent identifiers (PID) typically avoid to include semantic information, e.g. doi:10.17617/3.1STIJV
.
- Are the names intuitive, i.e. can you get an idea of the folder/file content without looking into a README file?
- What naming convention is used in this dataset? Is it intuitive and logical? Is it explicitly described?
- In case of multiple files: Do they appear in a logical order when sorted alphabetically?
- Are there problematic characters like spaces, non-ASCII characters, etc.?
- What about the length of the names?
- Discuss: What would you leave as it is, what would you change, or what are the alternatives?
- Probably the prefix has some meaning, but we can only speculate.
- Files consist of prefix
amb_hourly_qc_wc4.4_cal6.0_
, followed by year and month (e.g.2014_08
), followed by_core-params.csv
. Prefix might be intuitive for researchers of that field, but it is not explicitly described. - Yes, files are sorted according to the month of measurement.
- Yes: File name
AMB hourly, readme.rtf
contains spaces and a comma. The other file names contain several dots (should only be one dot, namely before the file extensioncsv
). - Length is not problematic, but longer than needed.
- Replace
AMB hourly, readme.rtf
byREADME.rtf
. The other file names can be shortened, e.g.HourlyCoreParams_2014_08.csv
. And if the file name prefixamb_hourly_qc_wc4.4_cal6.0_
contains relevant information, this should be explicitly given in the metadata or README file.
- Probably yes. Anyhow, their content is mentioned in the README file.
- The dataset contains only 6 files, thus there is not really a convention available, and also not needed. The non-intuitive parts
wos
andfo
are explained in the README file (namely “Web of Science data” and “Faculty Opinions data”). The files inside the tar.gz-files seem to follow some convention, and their content is explicitly mentioned in the README file. - Only few files, does order not important.
- No problematic characters found.
- Length of the names: OK
Following notes relate to the content of OSF Storage
.
- Meaning not clear for all files.
- Not so clear, but not many files, thus no clear conventions needed. However, the files in folder
result
are lacking an explanation in the README file, and their names are not very intuitive. - Only few files, thus order not important.
- No problematic characters found in
OSF Storage
. - Length of the names: OK
- Yes, names are meaningful and intuitive.
- The subfolders of
Data/Group
andData/Solo
have names like01_09_2022__10_13_33
, which seem to refer to a date and maye time of day. - Subfolders are not in a chronological order, because the date is given in a disadvantageous format (e.g.
01_09_2022
) - better would be2022_09_01
or2022-09-01
. - Yes: Folder name
Stan model code
contains spaces. - Length of the names: OK
- If folder name
01_09_2022__10_13_33
stands for timestamp 2022-09-01T10:13:33, then it could be renamed to20220901T101333
or2022-09-01_101333
.
Documents may evolve over time. File versioning allows for reverting to earlier versions if needed and shall allow for keeping track of changes, including documentation on the underlying rationale and people involved.
Version control can be done either manually by using naming conventions or by using a version control system like Git. The following hints apply to manual version control, meaning that you store both the current and previous versions in your file system.
- Versions should be numbered consecutively, e.g.
Handbook_v3.pdf
. Major changes (v1, v2, v3, …) can be distinguished from minor ones (v1-1, v1-2, v1-3 or 1a, 1b, 1c). You may use leading zeros if you expect more than nine versions. - Alternatively, a date or timestamp could indicate the version, e.g.
Handbook_v20240725.pdf
. - You may use qualifiers such as “raw” or “processed” for data or “draft” or “internal” for documents. However, note that terms such as “final”, “final2”, “final-revised”, “final-changed_again”, and “final_ready” can be confusing. In other words: Avoid the word “final” in file names.
- Document your versioning convention, e.g. what you mean with major or minor changes.
- Document the essential changes you have made between the versions.
For further reading: GitHub recommends version names like ‘1.3.2’ for the releases of software products, details see Semantic Versioning 2.0.0.
File formats
A file format has to be chosen when storing information in a file. It builds the backbone of your data and is usually specified by the file extension (e.g. .txt). To keep your data interoperable, the format needs a clear structure. This makes your data easy to read with many software products (e.g., out-of-the-box solutions or by writing a small script). Clear documentation of the file format shall be publicly available. Considering all these aspects, the chance is high that the file can be read in future, making it suitable for long-term preservation - which is one of our main goals when managing data. Therefore, open file formats are recommended, while proprietary formats should be avoided.
Ideally, when choosing a suitable format, you’ll consider the following properties:
- Readable by humans with a simple editor
- Readable with many programs
- Easy to understand, low complexity
- Small (storage space)
- Quick to read (performance)
However, usually compromises have to be made. For example, binary files are generally more performant than csv files and thus more suitable during the active research process. At the same time, csv is a well-established format for long-term preservation and is easier for humans to read.
Often, proprietary formats have intentionally no proper documentation as the company behind the system wants to keep their business information behind closed doors. The companies sometimes even use technical protection mechanisms, making the file format readable only by commercial software. This reduces the interoperability and reusability of the files and, in the worst case, makes them unreadable in the long term. (Imagine the company that provided the software and file format no longer exists.) Furthermore, the files might contain hidden (potentially sensitive) information. Thus, such formats should be avoided.
In the following list, you’ll find some formats which are widely used, well-documented and readable with several programs.
- For documentation:
- Plain text (.txt)
- HTML, XHTML, Markdown
- PDF (PDF/A-1)
- maybe: Rich Text Format (.rtf), Open Document Text (.odt), docx, …
- Tabular data:
- Comma-separated values (.csv)
- Tab-delimited (.tab)
- maybe: Open Document Spreadsheet (.ods), xlsx, …
- Nested data:
- JSON
- XML
- Further formats:
- NetCDF, HDF5, …
- png, jpg, …
Notes:
- PDF: PDF has been developed by Adobe Inc. and thus originally had been a proprietary format, and several versions exist. Nevertheless, the format is widely used today. For archival purposes, a PDF/A version is the best choice. PDF is best suited for fixed documentation. However, editing PDF files or extracting data from them takes a lot of work.
- Spreadsheet files: Spreadsheets may look nice, particularly when formatted in a colourful way. But for the machine-readability, this can cause problems. In particular, we do not recommend that you present relevant information just by formatting content differently. You can take this as a rule of thumb: Spreadsheet files like .xlsx or .ods are not well machine-readable.
- Are the files stored in an open or a proprietary format?
- Is the file format used “future-proof”, e.g., suitable for long-term archiving?
- How easy is it to open the file (regarding available programs and file size)?
- How complex are the files? What is their internal structure?
- What about performance and file size?
- How easy is it to understand the file structure as humans?
- Are they machine-readable and standardized? How easy is it to write a script to read the files?
- Which alternative formats exist?
- Files are ASCII files, thus open.
- Yes, ASCII is suitable for long-term archiving.
- Easy to open, e.g. with text editor.
- Files have tabular shape.
- OK, file sizes are below 1 MB.
- Shape: easy to understand, meaning of the columns given in README file.
- Most data analysis programs have import functions for csv. The quotes in the first column might be cumbersome for some import routines.
- Tab-separated files, spreadsheet files, etc.
- The small files are ASCII or UTF-8 files, thus open. The tar.gz files are compressed TAR-files, thus also in an open format.
- Yes, ASCII is definitively suitable for long-term archiving. Also tar.gz files are widely used and can thus be considered suitable for long-term archiving.
- The tar.gz files need specific software for extraction, which is freely available, but maybe not installed everywhere, and not all people are familiar with. Thus it is commandable that the extaction is described in the README file. However, the file size of several GB can be problematic for users having a slow internet connection. And unpacked, the largest file is more than 26 GB, more than the RAM size of many computers.
- The data files (inside the tar.gz) are not complex, just tables.
- Due to compression, the file size is reduced for storage and download. However, the tables contain many digits, probably more than needed. Reducing them would decrease file size. Binary files instead of ASCII files would need less time for loading.
- Shape: easy to understand, meaning of column see README file.
- Most data analysis programs have import functions for csv.
- Binary files like HDF, which could enhance performance.
Following notes relate to the content of “OSF Storage”.
- Most files are in an open format: ASCII tables, JSON files, R scripts. But what are “nii.gz” files in folder “results” - maybe zipped NIfTI files?
- Yes for ASCII tables and JSON files; maybe yes for nii.gz files.
- ASCII tables and JSON files: easy to open with every text editor, special software or libraries needed for nii.gz.
- Files in folder
data
are tables (csv) or Codebooks (in JSON format) describing those. - OK, because the files are not very large.
- ASCII tables and JSON files are easy to understand by humans; nii.gz needs suitable software.
- Most data analysis programs have import functions for csv, also JSON import functions are available for several programs.
- For csv-tables: Tab-separated files, spreadsheet files, etc; for JSON: XML
- Files are stored as ASCII tables or plain text files, which are open formats.
- Yes, suitable for long-term archiving.
- Easy, readable with text editor.
- Data files are ASCII tables.
- Due to compression, the file size is reduced for storage and download. Binary files instead of ASCII files would need less time for loading.
- The format is easy to understand by humans, but the columns are not explicitly described.
- Most data analysis programs have import functions for semicolon-separated tables.
- Binary files like HDF could be used (cf note above related to performance).
A gold standard for storing digital information is an ASCII file. In an ASCII file, each byte represents one visible character (except for the white spaces and control characters like tab stop and linebreaks).
Therefore, ASCII files can be read or opened by any text editor or data-processing software, even with programs like Excel, Word, Wordpad or web browsers (only possibly limited regarding the file size).
Characters beyond ASCII:
An ASCII file can only contain the following visible characters: !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
Otherwise, it is not an ASCII file.
For some years, the Unicode-based file format “UTF-8” has been available, which can represent many characters beyond the ASCII characters, like “ü”, “€”, and even some smilies ☺. Nowadays, UTF-8 is supported by many editors and browsers. The good thing about UTF-8 is that as long as a UTF-8 file contains only ASCII characters, the UTF-8 file is automatically an ASCII file. In other words, an ASCII file is a super-interoperable UTF-8 file.
Tabular text files (optional)
Please note that the task in this section is optional. You can go through this section if you still have some time left during the workshop or read it afterwards.
Tabular text files store data in a structured format, where each row represents a record and each column represents a field, with data separated by a designated column separator. Even after deciding to store tabular data in text files (e.g. files which can be opened in any editor), there are various ways and conventions to choose from:
- Column separator: typically tab or comma, sometimes space or semicolon
- Numeric values: handling of missing values (e.g. “NA”, ““, etc.)
- Representation of timestamps, e.g. “2024-08-01T08:59”
- Header lines with meta information?
- Encoding: Recommended is ASCII or UTF-8
- How is the file encoded (e.g. ASCII, UTF-8)?
- Numbers: What about their precision (enough or too much)?
- Special numbers: Do special numbers like “NA”, ““,”N/A”, “999”, “0” occur? Is their meaning documented?
- Time: Which format is used for the date and time of day? Which time zone is used?
- Tables: What do you generally notice (regarding the choice of separator, whitespaces, missing columns)?
- Is the content of the table self-explaining (column description, which units used), or is it explained elsewhere (README file or Codebook)?
- ASCII
- Has many digits, e.g.
986.223944276841
. - No information about missing values found in README file. But file
amb_hourly_qc_wc4.4_cal6.0_2017_03_core-params.csv
containsNA
. - Time is ISO 8601 conform, except that a space is given between date and time of day, e.g.
2017-03-23 09:30:00
. In readme file mentioned: “All times given in GMT”. - Comma as column separator, whitespace only between date and time of day, no missing columns found.
- Not self-explaining but mentioned in README file, also the units.
- UTF-8, except for the tar.gz. The files inside those tar.gz are even ASCII files.
- Probably more digits than needed, e.g.
-13.333333333333336
. Considering the file size, shortening them could be worthwile. - No information about missing values found in README file. But
NA
found in several files. - Time: There seems to be no time column.
- Tables: Comma as column separator, no missing columns or whitespaces found.
- Not self-explaining but columns mentioned in README file.
Following notes relate to the content of OSF Storage
.
- Encoding: ASCII files (except for the nii.gz files)
- Numbers: e.g.
0.878519
- looks reasonable - No information about missing values found in README file. But
NA
found in several files. - Time: There seems to be no time column.
- Tables: Comma as column separator, no missing columns or whitespaces found.
- Content of the table is explained in JSON file (Codebook).
- Data tables are ASCII files.
- Numbers: e.g.
73.12958
- looks reasonable - Special numbers or contents: Some Columns contain parenthesis - what is their meaning?
- Time: Time column with seconds(?) since start time?
- Tables: Semicolon as column separator, also semicolon after last column.
- The README file says “Variable names should be quite descriptive, but please get in touch in case anything is unclear”, but not all columns are so clear to understand.
References
Examples and notes have been adapted from: Onboarding into Research Data Management, Franke et al. 2024, https://hdl.handle.net/21.11116/0000-000E-194D-1, file “FDM-Onboarding-2024_CPT-Slides.pdf” pages 44-51, 56-59.
Footnotes
adapted from https://datadryad.org/stash/best_practices#organize↩︎
adapted from https://datadryad.org/stash/best_practices#organize↩︎
adapted from Suse Prejawa (2021, https://hdl.handle.net/21.11116/0000-0008-662A-7)↩︎
adapted from a template used at the Max Planck Institute for Chemistry for measurement projects/campaigns (e.g. with a research aircraft)↩︎