ro2019

Logo

Workshop on Research Objects 2019

View the Project on GitHub ResearchObject/ro2019

Peer Review of RO-3

Review 1

Quality of Writing

Is the text easy to follow? Are core concepts defined or referenced? Is it clear what is the author’s contribution?

There are some grammar and spelling issues, but the paper is readable overall.

Research Object / Zenodo

URL for a Research Object or Zenodo record provided?   Guidelines followed?   Open format (e.g. HTML)?   Sufficient metadata, e.g. links to software?   Some form of Data Package provided?   Add text below if you need to clarify your score.

Paper at Zenodo, Notebook provided at github, but raw data must be retrieved and loaded into triple stores to re-run the analysis presented.

Overall evaluation

Please provide a brief review, including a justification for your scores. Both score and review text are required.

The paper presents a statistical analysis of the number of triples used in nanopublications from several major sources. It also presents some inspection of the contents of some triples, such as the author-related provenance triples.

The statistical analysis provides an overview of the size of individual nanopublications and the distribution of triples between the 4 sub-graphs that compose them. However, there is little discussion of why these statistics are useful. The variability in triple count is primarily to variability in the underlying data types being represented (and the analysis presented can’t discern between this and variability in triple representation for different instances of the same data type). The authors note that one motivation for the work was to reproduce an earlier analysis (hence the same version of datasets were used).

The paper doesn’t mention whether the results were indeed reproduced (presumably they were since it involves straight forward counts). Further, that earlier analysis was done primarily to show the overhead involved in nanopublications (triples not related to assertions, repetitive triples that are the same for large numbers of publications from the same source, etc.) as background for discussion of compression methods.

The current paper, which adds analysis of variability, does not connect with the discussion of overheads, nor does it motivate the additional variability analysis with any tie to the data quality issues highlighted in the introduction. (While the hypothesis that ‘since there is little variation in the number of triples in the publication graph that there are issues with the data quality’ is stated, it is not supported with any argument or analysis.) Instead the assessment of data quality appears to come from the analysis of the contents of specific triples, where it appears some sources confuse authorship (who’s claiming the assertion) with creatorship (who/what software created the nanopublication itself), which do not affect the triple count or variability.

Beyond this claim, the paper claims that Wikipathways use of linked data references to author information to reduce nanopub triple counts is not a good practice (isn’t this the intent of linked data? it is certainly a trade-off between making the nanopub complete/stand-alone and reducing the number of triples that have to be retrieved, but what proof is there that this trade-off is bad (impedes science) rather than useful?) and implies that some data sources may use fewer triples because they do not collect/contain all the information that other sources do, without providing evidence that this is indeed the case (‘possibly because it is not captured in the underlying database’ is all that is in the paper).

Overall, while the paper does a reasonable job of presenting what work was done, it does not make a clear case for the importance of the statistical analysis, drawing conclusions primarily from the non-statistical inspection of triple values. With the emphasis in the main text on the statistical portion, discussion of the qualitative analysis is limited and the only well-supported conclusion is that some data sources are misusing the author concept. The claims in the introduction that analysis of triple distribution in nanopublications is useful in assessing data quality and that community guidelines are needed are not well supported.

Review 2

Quality of Writing

Is the text easy to follow? Are core concepts defined or referenced? Is it clear what is the author’s contribution?

Clear contribution and concepts well defined. See detailed review below.

Research Object / Zenodo

URL for a Research Object or Zenodo record provided?   Guidelines followed?   Open format (e.g. HTML)?   Sufficient metadata, e.g. links to software?   Some form of Data Package provided?   Add text below if you need to clarify your score.

A GitHub repository and notebook, but I cannot see that option here.

Overall evaluation

Please provide a brief review, including a justification for your scores. Both score and review text are required.

This paper provides a thorough statistical description and analysis of the statistical results for the large corpus of nano publications that is currently available in the state of the art. The main idea that the authors have with this paper is to show that there is a clear need for getting into clear agreements and good practices on how to generate and publish nano publications beyond the pure technical approach, which is well understood nowadays. That is, there is a need to agree on how to express some of the aspects that are related mostly to the graphs that have to do with publications and provenance (mostly).

The research done towards obtaining this conclusions is relevant, and the results obtained are reproducible (I have not been able to reproduce it completely in the provided notebook since I did not have access in my computer to being able to load all the nanopublications in a Virtuoso store, due to limited amount of time), but from watching the code I can confirm that I would be able to run it should I have the data available.

A few comments that would make the paper even better, IMO:

Minor comment: I would recommend the authors to go over the text of the paper with care, since there are many misspellings and grammar issues throughout the paper, which would need to be addressed to make the paper easier to read. Anyway, this has not been taken into account in my evaluation, as this is a workshop paper.

Editorial discussion

(Omitted: Confidential discussion with author, RO2019 chairs and Reviewer 1)

Meta-review: Stian Soiland-Reyes Fri 26 Jul 2019

(authors query late review, considering the approaching early-bird deadline)

We have the reviews of your article, which I’m afraid were mixed, so we have decided it is Accepted for oral presentation and will be listed as-is in the RO2019 proceedings, but will not be accepted as a paper for the IEEE eScience proceedings.

You are of course free to update the Zenodo record until the conference day in case you want to address the review comments.

Comment: Stian Soiland-Reyes Fri 26 Jul 2019

I see now the camera-ready deadline has moved a week to Mon 6 Aug - do you think this could be sufficient to correct a majority of the issues from the review..? (..)

Comment: Alasdair J G Gray Fri 26 Jul 2019

We could certainly look at the argumentation in the paper to address some of the first reviewers comments, e.g. linking the discussion back to the overhead analysis, and amending the argument around the Linked Data approach.

We would also look at where we could incorporate the suggestions from Oscar, and of course fix the spellings and grammar (..).

Whether you feel this is enough is up to you.

Email to Reviewer 1 Sat 27 Jul 2019

Thanks for your RO2019 reviews, which I thought were thorough and helpful.

We are now sending out the acceptance/rejection letters.

We have one question about the below submission #3.

We told the authors that given the reviews we could only accept #3 for a oral talk (as if it was an abstract), but leave it out of the IEEE eScience proceedings.

Then it came to our notice that the IEEE camera-ready deadline has been moved a week to Mon 5 August. The author contacted us to suggest they could improve the manuscript by then:

We could certainly look at the argumentation in the paper to address some of the first reviewer’s comments, e.g. linking the discussion back to the overhead analysis, and amending the argument around the Linked Data approach.

We would also look at where we could incorporate the suggestions from reviewer 2, and of course fix the spellings and grammar. (..)

Whether you feel this is enough is up to you.

What do you think? Would this be sufficient to change your decision - conditional on these changes in an updated submission?

If you want (and have time) you could do a quick re-review near end of next week, or the chairs will check and add a metareview.

Full disclosure: The co-author Alasdair J G Gray used to work in our lab at University of Manchester, and collaborates with Carole Goble in ELIXIR and BioSchemas.

(Reviewer 1 responds that the main concern is the implied link from the statistical analysies to the (valid) quality issues. The reviewer do not object to the paper being included after updates if the chairs think the updates are reasonable)

Comment: Stian Soiland-Reyes Tue 30 Jul 2019

(..) I have checked with the other chairs and Reviewer 1, and we think if you can try to address the review comments in a revised paper by this Friday [2 Aug 2019] (just update in Zenodo and notify ro2019@easychair.org), then it can go into the eScience proceedings after the RO2019 chairs verify the updates.

Reviewer 1 highlighted:

“it looks to me like the statistical analysis has almost nothing to do with the (valid) quality issues that were found by inspecting the values of the triples. (..) “

Thus it is primarily the assertion that the statistical analysis helped find quality issues that should be revisited in your updates.

Comment: Alasdair J G Gray Tue 30 Jul 2019

Thanks for your follow up email. We will make the changes outlined below [see previous comments, eds.] so that the paper can be accepted for publication.

(..)

(Revised paper https://doi.org/10.5281/zenodo.3358983 was received Fri 2 Aug 2019)

Re-review 3

Review of revised submission https://doi.org/10.5281/zenodo.3358983

Quality of Writing

Research Object / Zenodo

[x] URL for a Research Object or Zenodo record provided? [x] Guidelines followed? [ ] Open format (e.g. HTML)? [-] Sufficient metadata, e.g. links to software? [ ] Some form of Data Package provided?

Add text below if you need to clarify your score.

The paper includes links to the source code in GitHub https://github.com/ImranAsif48/RO2019. The repository contains the aggregated results as a series of CSV files that have not been further described or related; however there is a Jupyter Notebook that shows how they were produced.

The notebook is launchable with myBinder - using the pre-extracted CSV datasets instead of the original localhost SPARQL endpoint. It is unclear how their SPARQL endpoint was populated. However myBinder instance does not work as dependencies like seaborn are not installed - see https://mybinder.readthedocs.io/en/latest/using.html#preparing-a-repository-for-binder to add the Jupyter dependencies - several later calls also fail as the repository does not match, e.g. [Errno 2] No such file or directory: 'distributionDisGeNET.csv'

I would recommend authors try to update the Jupyter notebook to work outside the author’s environment, ideally from the GitHub repo in myBinder.

There is no DOI/Zenodo archive of the repository, see https://guides.github.com/activities/citable-code/

The paper has appropriate data and code citations as both URLs and regular citations. These are however not reflected in the Zenodo metadata, which only contain the PDF, abstract and author information. As a minimum I would expect the links to the GitHub repo (or it’s Zenodo release) as Supplementary Material.

As the topic of this paper is nanopublications, perhaps I would have expected new nanopublication(s) to contain such links as linked data.

Overall evaluation

Please provide a brief review, including a justification for your scores. Both score and review text are required.

The authors have addressed both reviews, improved the language, expanded the discussion and significantly strengthened the Conclusion section.

This paper takes a critical view of how nanopublications have been used in practice, in particular to compare their provenance coverage. No issues seem to be raised from the assertion graphs (their triple characteristics are more variable, as one would expect), only the provenance and publication graphs are considered in detail. Presumably this is because the assertion triples are much more domain-specific and tend to use more complex patterns that combine multiple OBO ontologies.

The authors highlight that more detailed guidelines are required for the creation of high-quality nanopublications but do not provide a first go of what such guidelines should contain.

I believe the paper is no longer making the implication that the qualitative assessments directly follow from the quantitative analyses (as raised by Reviewer 1), the text now better describe how the initial triple statistics led the authors to the deeper analyses that raise several quality issues in the nanopublications’ provenance.

I agree with the authors that it is misleading for nanopublications to use pav:authoredBy to indicate the authors of the software that created the nanopublication, as we argue in PAV paper, the Authors are the primary originators of scientific statements, originally conceiving the content.

Side-note: This raises the question of what is the knowledge content of a nanopublication? Arguably it is both the structural assertion and the surrounding metadata about the nanopub itself - but PAV consider the author as who made the assertion in whatever form (it might have been orally), the curator as the one structuring this information, and the creator the one putting the structure into a digital of physical form. Nanopubs allow for this distinction by separating who authored the assertion graph vs who created the nanopub structure (using different subjects), but this paper highlights that in practice this provenance is not consistently clear.

Overall I think this revised paper is well-written and it highlights important metadata provenance challenges, which should also be of importance not just to the nanopublication community, but to Research Objects in general (in particularly for automatically creates ROs).