model: Add a new field for authors#3627
Conversation
"contributors" |
|
And please rebase @bs-ondem to resolve a conflict. |
| * TODO: Remove default value after upgrading all package manager implementations. | ||
| */ | ||
| @JsonInclude(JsonInclude.Include.NON_NULL) | ||
| val declaredContributors: SortedSet<String>? = null, |
There was a problem hiding this comment.
We try to avoid nullable properties in the model, in this case an empty set is just as good as null and makes the property easier to use.
Instead of adding a default value here, you could already pass an empty set in all package managers and add the TODOs there, this would make it easier to track for which package managers this still needs to be implemented.
Would it make sense to capture this for projects as well? At least for some package managers like Maven we get the same metadata for projects as for packages. It might not be needed for copyright curations, but it would be more consistent.
There was a problem hiding this comment.
I added the declaredContributors field also in the Project class and my intention is to track also the contributors for the projects as well, even if this is not required to solve the issue #3400.
063309a to
9f62ce8
Compare
There was a problem hiding this comment.
@sschuberth @bs-ondem We collect this information to do notices generation to show who owns the code and under which license they distributed it. Contributors are often not from a legal point of view authors/code owner as they signed a copyright assignment e.g. this is why most ORT code is 'HERE Europe B.V.' and/or 'Bosch.IO GmbH'.
Author is a term used in copyright law see the copyright law of the united states, June 2020 - it contains 622 matches on 'author' and 0 matches on 'contributor' which is exactly why most package managers I mention in my #3400 comment use author as a field name
Propose to rename declaredContributors to declaredAuthors if your intent is to store the under copyright law recognized authors/owner of the code for use in notice generation.
Furthermore I would update the comments referring to 'authors/developers/contributors' to just mention authors. Why? Some package manager such as npm has both author and contributors fields see 1, also Maven has developers/contributors see 2. My understanding of this field is that we only want to capture code owners and do not plan to capture any of the other people fields as part of #3400.
@tsteenbe, would it then be fine in your view to populate And "contributors" we should generally ignore then, correct? |
9f62ce8 to
92ca1ec
Compare
|
@sschuberth In case of Maven: Do you mean the organization of the project or the organization of each developer, or both? Because in Maven it's possible to specify the organization of both. @tsteenbe Thanks for the clarification. I will rename it to |
Good question. I was referring to |
92ca1ec to
c90974e
Compare
|
@sschuberth to your above questions I propose we do the following:
and then run a de-duplicate to ensure we only have unique authors. |
c90974e to
dc35ac1
Compare
| * The list of authors declared for this package. | ||
| */ | ||
| @JsonInclude(JsonInclude.Include.NON_DEFAULT) | ||
| val declaredAuthors: SortedSet<String> = sortedSetOf(), |
There was a problem hiding this comment.
I believe what @mnonnenmacher meant when he said "Instead of adding a default value here, you could already pass an empty set in all package managers" was that you should not only not assign null here, but no default value at all. That way you're forced to initialize the field in all package managers, which IMO was the intent. And that also implies that you should squash the two commits.
Also, as for licenses, we IMO should always serialize the field, even if it's empty.
Finally, I'm a bit unsure about the field ordering. My hunch is that we should keep all license-related fields grouped together, and add the new declaredAuthors between concludedLicense and description. What do you think, @mnonnenmacher?
There was a problem hiding this comment.
I believe what @mnonnenmacher meant when he said "Instead of adding a default value here, you could already pass an empty set in all package managers" was that you should not only not assign
nullhere, but no default value at all. That way you're forced to initialize the field in all package managers, which IMO was the intent.
Introducing this new field without a default value is a rather breaking change because then existing ORT results can no longer be deserialized. This would also impact the data stored in scan result storages. This may be acceptable, but is something we should be aware of.
There was a problem hiding this comment.
My bad, I didn't think of deserializing old results. @mnonnenmacher I guess we want / need to maintain backward-compatibility here? But we should still drop @JsonInclude(JsonInclude.Include.NON_DEFAULT), right?
There was a problem hiding this comment.
The annotation is not necessary. We put it here for practical reasons because otherwise all tests for analyzers would fail that compare current results against stored yml files. Maybe we could keep the annotation until we have reworked the analyzer implementations to populate the authors field? In the course of this change the test files would have to be adapted anyway.
There was a problem hiding this comment.
Finally, I'm a bit unsure about the field ordering.
I grouped declaredLicenses and declaredAuthors in order to have to the "declared" things together.
But we should still drop @JsonInclude(JsonInclude.Include.NON_DEFAULT), right?
As @oheger-bosch already mentioned, if we drop the annotation here then we have to adapted all analyzer tests in this PR and again after we have reworked the analyzers. Can we keep the annotation here with a TODO comment e.g. "Can be removed after all package manager implementations have filled the field declaredAuthors accordingly."?
There was a problem hiding this comment.
Here's one more thought: Does it even make sense to call this declaredAuthors instead of just authors after all? While our original idea of naming the field declaredCopyrights made sense to distinguish from detected copyrights, we already figured that this field does not contain full copyright statements in the "detected copyright" sense as it lacks the actual "Copyright (C)" prefix and the year.
So why not treat this as just another metadata field, like e.g. homepageUrl, name it just authors, put it before description, and let later logic consider "authors" as "declared copyright holders" (similar to like we also consider the "homepageUrl" later as a fallback for a VCS URL)?
There was a problem hiding this comment.
The original plan was to treat author information analogously to declared licenses and allow an analogous curation mechanism. In this context, the name declaredAuthors would make sense.
If we treat authors as regular metadata, would we then still want to curate them? Or would the curation rather apply to the logic that transforms authors to copyright holders? If so, would the curations still be defined in curations.yml or would this be a different mechanism?
There was a problem hiding this comment.
Please have a look at the PackageCurationData class: We can curate any metadata, no matter whether it start with a "declared" prefix or not. So my point is purely about how to name the field, the original plan and all processing of the field's data stays the same.
My (somewhat rhetorical) question boils down to: If we had a declaredAuthors field, what does "declared" mean here, and how does it semantically differ from just authors, as any metadata in the Package / Project classes is just "declared" (and not detected) metadata. Phrased differently, why is homepageUrl not called declaredHomepageUrl. Well, because "declared" is redundant in this (the analyzer's) context.
Going further off-topic: Consequently, I actually believe we should have named the declaredLicenses fields also just licenses here, and only further tooling that also has to deal with other kinds of licenses should refer to it as declaredLicenses.
There was a problem hiding this comment.
But why is declaredAuthors then between declaredLicenses and declaredLicensesProcessed instead of after declaredLicensesProcessed?
Because for me declaredLicensesProcessed is not a field for gathering metadata directly from the package manager but used a processor in order to map the declaredLicenses to SpdxExpression.
Phrased differently, why is homepageUrl not called declaredHomepageUrl. Well, because "declared" is redundant in this (the analyzer's) context.
From this point of view it makes sense to name it "authors" but as you mentioned we have to rename consequently also "declaredLicenses" to "licenses".
But we have to come to a conclusion and therefor I'll try to summarize the suggestions and we should decide on one:
- Field name
declaredAuthorsand put it beforedeclaredLicenses. - Field name
declaredAuthorsand put it beforedescription. - Field name
authorsand put it (?) beforedeclaredLicensesordescriptions. In addition, renamedeclaredLicensestolicenses.
There was a problem hiding this comment.
I've put this onto the agenda for the next dev meeting to move forward.
dc35ac1 to
ad65845
Compare
This is a preparation for solving oss-review-toolkit#3400 and making copyright holder curations possible. A new field authors is added to Package, PackageCuration and Project in order to be able to track fields like authors from various package managers. These changes don't include the adaptation of the package managers. Signed-off-by: Onur Demirci <onur.demirci@bosch.io>
To keep track of which package manager needs to be expanded with authors, initialize all Project/Package objects with an empty set for authors and add a TODO comment above. The package managers can be expanded step by step later. Signed-off-by: Onur Demirci <onur.demirci@bosch.io>
ad65845 to
4f8c184
Compare
Comments addressed.
This is a preparation for solving #3400 and making copyright holder
curations possible. A new field declaredAuthors is added to
Package, PackageCuration and Project in order to be able to track
fields like authors from various package managers. These changes
don't include the adaptation of the package managers.