Evaluating the RAIL license family
Machine learning (ML) is the hot topic in tech circles right now, and tech lawyers are no exception. Virtually every lawyer discussion I’ve had in the last two weeks has ended with a variation on this question:
So let’s jump in.
tldr: it’s not Open as defined by the Open Source Initiative, but it may still be the most important license of the next 3-5 years—all the more reason to take it seriously, work to build bridges, and find ways to improve it.
Lawyer reviewing a document, generated by Stable Diffusion—one of the ML projects licensed under a RAIL license.
If you’re not up-to-the-minute on the latest machine learning trends, here’s some important background. The new hotnesses are Stable Diffusion and BLOOM—machine-learning tools for generating images and text.
Like all machine learning tools, Stable Diffusion and BLOOM are combinations of a “model” (the actual machine-learning component) and the wrappers around the models that are used to execute them. These wrappers are often based on standard frameworks like PyTorch, but may be custom, or contain substantial code to parse prompts and model outputs.
What makes Stable Diffusion and BLOOM interesting to us, besides their creative potential, is that both are licensed to the public under a family of new artificial intelligence (AI) licenses from the Responsible AI Licenses (RAIL) initiative. (The specific versions can be a bit tricky to follow, so I will simply refer to the RAIL family rather than to specific versions. A deep dive on the various types and history can be found on the RAIL site.)
Ethical and regulatory background
Surveys show AI/ML practitioners are highly attuned to ethical concerns in the application of their code, and they have developed a wide range of techniques to address these concerns, like model cards and bias toolkits. (The diagram above, showing a more complete view into the ecosystem, is from the introduction to the BigScience RAIL license).
Until recently, one primary mechanism for dealing with these concerns was to make models available only to trusted partners, like OpenAI’s partnership with Microsoft. This secrecy-based approach worked in large part because training an AI model was too expensive for any open communities to do on their own.
However, for a variety of reasons, the willingness of various parties (including established players like Facebook and Amazon, and new startups like HuggingFace and stability.ai) to fund public or semi-public training has gone up. As a result, more models have become publicly available. This means secrecy is no longer enough—ML will require a new approach to control and governance.
As part of this, the ML community has turned to an old tool—copyright licensing. The RAIL-M group of licenses (they are similar, but not identical) is one of a number of new licenses that attempt to enforce AI-specific ethical obligations through copyright licensing. This post will analyze RAIL-M to help readers understand both the specifics of this license, as well as the more general pitfalls and challenges that face any attempt to regulate ethics through copyright licensing.
(Does this matter?)
The rest of this post is very lawyerly–analyzing a text, spotting strengths and weaknesses. But a non-lawyerly way to ask “is this open?” is to say “is a community of real people, doing real collaborative work, coalescing around this thing?” Stable Diffusion and BLOOM are already generating healthy communities. For example, hackers are improving Stable Diffusion performance (both speed and RAM usage), and other companies are using it for their core products.
“Is there a community?” is not the only test of a license, of course. We don’t have to look any further back than crypto to remind ourselves that vibrant communities can be built on dangerously flawed legal premises. That said, healthy communities are an important reality check that open license analysts should take seriously.
Are the RAIL licenses likely to see wide adoption?
I wrote in 2021 about what qualities a successful license should have, and many of the questions asked in that post are extremely relevant to the question of ethical AI/ML licenses. Note that these criteria rarely speak to the quality of the license—they’re external factors that might make the license succeed (or fail) almost regardless of how well it is written.
- The “unavoidable” application: Unlike virtually all other attempts at public software licenses that enforce ethical restrictions, RAIL licenses are used by potentially unavoidable applications like Stable Diffusion and BLOOM. Like Linux and MySQL forced lawyers to come to terms with GPL, Stable Diffusion and BLOOM are likely going to force a lot of lawyers to learn at least the basic contours RAIL group of licenses.
- Documentation and education: The RAIL initiative appears to be working on this challenge, with a documentation website and appearances by its authors on educational panels. These are positive signs, but this sort of education is a long-term commitment, and it is still too early to know how this will play out. This is particularly true because, like open source in the early 2000s, there are already a flurry of licenses in this space, and distinguishing between them will be important for developers and lawyers alike. (Note also that this needs to be a two-way commitment by drafters and those who need to be educated. To that end, the ‘traditional’ open legal community is starting to do outreach to the RAIL community to help both sides learn.)
- Vision and evangelism: In the traditional open source community, leadership and funders are often opposed to licensing that limits usage. In contrast, the ethical restrictions in AI licensing are being requested and driven by practitioners, who then evangelize for the licenses. So there is a very ripe ground for broad-based evangelism for something like this license, even if not this license specifically.
- Partnerships: The RAIL initiative’s partnership with HuggingFace (a model and inference service that is the hub of much machine learning activity) will expose a lot of projects to the license through their license picker. This is usually a very hard hurdle for new licenses to get over.
- Governance: These licenses will likely have to be revised to take into account both user experience and fast-moving changes in both technology and external regulatory frameworks. This makes governance important, in contrast with older open licenses that may not have changed in decades. That said, the project is also young—I’m sure they’re aware of this challenge and working on it.
In reviewing this checklist of success factors, virtually all of the signs point towards a license that could be very widely adopted—regardless of the quality of the license drafting itself.
The licenses themselves
Because a line-by-line analysis of the licenses would be tedious, I will skip it here. Suffice to say that the RAIL-S and RAIL-M licenses are, in many senses, similar to other public software licenses. A few quick observations to set the table for the rest of my analysis.
- Basic structure: Both licenses borrow from Apache, and follow the basic template of most open licenses—descriptions of (1) what is being licensed, (2) what permissions are being granted to the public, and (3) what restrictions are placed on those grants.
- What is covered: The -S license is drafted to cover source code (such as that used for the wrapper code), while the -M license is drafted to cover the machine learning model itself. Licenses for data are apparently in the works.
- What is left out? While it borrows some language from the Apache license, the -S license does not grant a patent license. In a space where patents are being filed quickly, this is deeply problematic. (The -M license does grant a patent license, though I think it’s debatable whether models themselves are patentable.)
Where are the “responsible” components? RAIL stands for “responsible”, so the licenses contain language defining “responsible” usage of the code. Importantly, these can be modified by projects. RAIL-M in particular encourages this by including the responsible constraints in an Appendix rather than the body of the license. As a result, it’s less accurate to say “the RAIL-M license” and more accurate to speak of specific versions of the RAIL-M license, like BigScience or CreativeML.
Runner jumping a hurdle, generated by Stable Diffusion.
This license offers some genuinely new and interesting challenges, which are worth calling out. While these may come across as critiques, I think it’s worth stressing that these are hard challenges in a new technical-legal area, and it would be surprising if all the problems were solved this early—especially since the drafters cannot rely on legislation or caselaw to help them define and refine their work.
- Can a model even be licensed? Because it is early days, it is unclear whether copyright applies to a trained model. There is some creativity in choosing parameters for training a model, but the actual output is an n-dimensional vector, incomprehensible to human minds. Given this, the model is in many ways much closer to data than creative expression—and so may not be protectable by copyright. To put it another way: if someone came after me for violating this license on a model, my first defense would probably be that I don’t need a license at all. It’s unclear how to work around this; some data licenses have tried but they are arguably more binding in spirit than in the letter of the law.
- Binding all parties equally? When many developers work together and all become co-authors of a copyleft codebase, that makes it very hard to change the license. (Mozilla did this, but it took years, and the consensus is that it would be impossible to do for the Linux kernel.) This sounds like a problem, but can be an important form of protection, since it means that every contributor—big and small alike—must respect the license equally, since none of them could rewrite the code from scratch. However, for an ML model, well-resourced parties who have access to the source code can recreate the model from scratch. It is unclear how a license on a previous version of the model could bind the parties who retrained a new version. This makes it difficult to trust those large parties as an equal partner in a community, since they can reject the license for merely the cost of retraining. (And indeed, while this was being drafted, a small spat occurred about a release of version 1.5 of the stable-diffusion model, where the CIO of stability.ai seemed to discuss a release—without any reference to being bound by the licenses of the previous versions.)
- Model updates? The license says that the party that created the model “reserves the right to… update the Model through electronic means”, and that the user will “undertake reasonable efforts to use the latest version of the model”. This is in some ways the most radical clause in the license, and is challenging in two key ways. First, it grants control over running systems to a third party, which for many commercial entities would be even more objectionable than the ethical obligations. Second, it does not explain how conflicts between a modified downstream model and an updated upstream model should be resolved—for example, if I add functionality X, and the updated model intentionally blocks functionality X, the license leaves unspecified how the two parties are to resolve that conflict.
- Complementarity to other governance tools? As noted above, the RAIL authors understand the licenses to be part of a suite of practice and tools for ethical action in machine learning, not a standalone entity. How the license interacts with those other components (like nation-state regulation, model cards, etc.) is still underdefined. This is an area ripe for exciting innovation, but also ripe for drafting mistakes and failed predictions—I look forward to seeing how this evolves.
None of these factors are, in and of themselves, features that should block a company or potential contributor from participating in a RAIL-M licensed community. But they should at least give pause–and we should see how they play out over time.
Given the challenges inherent in any innovation, especially legal innovation, I offer these in the spirit of collaborative discussion rather than destructive critique. (Readers may want to compare and contrast with my comments on the earlier Hippocratic ethical license, as well.)
- Which ethical obligations? As already noted, the RAIL-M license has an appendix listing the prohibited uses, allowing different communities to choose different prohibitions. This is in some ways sensible (presumably some communities will have very specific concerns) but I suspect will create some interoperability problems—in essence, is each separate version of the appendix a different license? Can the models be mixed or chained together if the obligations are different? My knee-jerk reaction is that this will lead to conflicts and inconsistencies, but it’s also quite possible that this will lead to experimentation and improvement at a faster rate. (Beware premature standardization!)
- Who will obey the ethical obligations? Copyright licenses aren’t much good against criminals, because they are already ignoring many rules with more serious penalties. Instead, licenses are most useful when trying to stop large, well-lawyered corporations. If they use the software at all, those corporations will try to find loopholes—but they’ll also at least try to present the appearance of compliance. So, for example, the RAIL-M model license’s prohibition on providing medical advice is likely to have some impact, because hospitals, medical device providers, and national health care systems have large compliance teams. The prohibition on harassment is, in contrast, likely to be completely ineffective, because serial harassers are not the type who read license agreements. (Nor are software authors well-positioned to enforce the license against serial harassers.) The drafters may want to consider drafting differently for those two distinct threat models, perhaps by adopting the third-party enforcement provisions pioneered by the Cryptographic Autonomy License, or by accepting that (once a model is released) criminal law is likely a much better route for enforcement against criminals than copyright law. They might also want to consider requiring transparency rather than specific actions, with the goal of helping “real world” regulators understand and regulate—rather than regulating directly through the license.
- What do the ethical obligations mean? There is a reason why criminal and product liability laws (both of which are implicated by the RAIL-M template appendix) are typically hundreds of pages long—before considering the thousands of pages of caselaw that help us interpret those laws. These concepts can’t be shrunk into a single page without losing a lot of fidelity and accuracy, potentially making this “diet” version of them both over- and under-broad—a problem for enforcement and for community adoption.
- Terminology and boundaries? The license goes to a great deal of work to tailor itself to the many new terms in this new space. (It made me finally learn what distillation is!) Similarly, it tries hard to distinguish what terms apply to the model, to outputs, and to various other related materials. This shows both a good eye to detail (which is critical in a new space) but also worries me that it will age poorly, as the GPL has because of its reliance on C-specific terminology and technologies (like linking).
- Passing through to users: The license says all your users must also comply with the license. This is reasonable, but given the many different combinations of user interfaces and legal regimes that may apply to them, this is a tricky clause to get right in an enforceable way. The RAIL team provides model language for this purpose, but I have not reviewed it yet.
- Indemnification: Indemnity clauses are often an afterthought in open source licenses, but that’s in large part because there is very little liability for software. If the EU follows through on plans to create liability for AI and other classes of software, indemnities may become much more important. I would strongly encourage the authors to study the indemnity language of the Eclipse License, which attempts to protect volunteer creators from liability.
Is this open or not?
Neon sign for open, generated by Stable Diffusion. (It’s not very good at letters yet, which … actually felt appropriate for the post!)
Put simply: this family of licenses does not comply with the Open Source Initiative’s Open Source Definition, and any company that requires absolute freedom in how they run their infrastructure will try to reject it.
That said, in several colloquial senses it is open: it tries to encourage collaboration; it tries to hew to ethical boundaries (even if different boundaries than those historically associated with Free and Open movements); and it releases tools that can be widely (if not universally) used and modified.
In addition, even if it isn’t “open”, like Linux and the early open databases under GPL, execs may reject it and individual contributors may use it anyway. Given this support (both by users and the AI developer community) traditional open communities should be seeking to figure out how we can build bridges and share knowledge with these new communities. This is starting in places, with both the Open Source Initiative and Creative Commons looking at ML issues. I plan to also give time to this important effort—and hope, if you have relevant skills, you will too.