Report to Nature about Research Misconducts via
Post-Selections Using Test Sets

Juyang Weng1,2,3,4

1Department of Computer Science and Engineering

2Cognitive Science Program

3Neuroscience Program

Michigan State University, East Lansing, MI, 48824 USA

4GENISAMA LLC, 4460 Alderwood Drive, Okemos, Michigan 48864 USA

Submitted: June 4, 2021

Revised: June 25, 2021


Dear Dr. Helen Person, EIC of Nature:

All the respondents below are my friends and I look forward to collaborating with them. However, as a scientist, I must uphold research ethics and fulfill my duty to report research misconducts.

As a COPE member, Nature has its duty: “Journals should: (1) acknowledge receipt of communica- tions from institutions and respond promptly to findings of research misconduct; (2) inform institutions about possible misconduct and provide evidence to support these concerns; (3) investigate allegations of misconduct by researchers acting as peer reviewers for the journal, follow the COPE flowchart on such cases, and liaise with the institution as required; (4) follow the COPE guidelines on retractions.”

I hereby respectfully allege that the following papers published in Nature [1], [2], [3], [4], [5], [6],[7], [8], [9], [10], [11], [12], [13], [14], [15] have grossly violated research ethics. They should have used various Post-Selections Using Test Sets (PSUTS) reported in Weng 2021 [16]. The Post-Selection is a typically used unethical practice for neural network experiments, grossly violating experimental protocols well established in statistics and pattern recognition.

First, the authors used test sets like training sets. Their practice is like leaking all test problems to billions of machine students who are not only randomly initialized but also greedy in adaptation. The authors examine the individual lucks of their students on the leaked test sets and then pick up the luckiest student to report in the papers but they do not report all other remaining less lucky students in the papers.

Second, they did not report the Post-Selection stage, hiding this unethical practice. A rare exception is [4] which vaguely mentioned 20 students for the Post-Selection stage but did not report the performances of all these 20 students. In addition, 20 in [4] apparently did not include many more random students (about billions) embedded in a brute-force search for a huge number of hyper parameters, amounting to about billions of vectors of hyper-parameters, as part of their Post-Selection stages. This means about 20 billions of random-and-greedy students tried their lucks on the leaked test sets. Although [4] is insufficiently transparent, it at least admitted the existence of the Post-Selection stage. Therefore, all major data reported in these publications are nontransparent, and likely be fraudulent, incorrect, inaccurate and misleading. The “typical” nature of this unethical practice makes this report important and a due correction urgent. Otherwise, damages to credibility of science would continue and wastes of huge amounts of resources would escalate.

Why is reporting only the luckiest student unethical? The well-known cross-validation procedure in statistics and pattern recognition requires to report the average performance of all the random students trained. This is because the luckiest student from Post-Selection statistically does no better than the average when it takes an unleaked test set. For example, using only the relatively much simpler MNIST dataset, the error rate of the luckiest on an unleaked test set typically shots up about 10 times.

Some laymen may wonder how SuperVision [1], AlphaGo [3] and the Project Debater [12] took unleaked test sets in open competitions. I respectfully allege that the open contests reported in [1], [3], [12] all have major loopholes for PSUTS. Let us scrutinize each contest one by one.

Paper [1] claimed “almost halving the error rates of the best competing approaches”, however ImageNet ILSVRC-2012 did not explicitly ban hand-labeling test sets although such a practice is obviously unethical. Based on the scientific analysis in [16] about the fundamental flaws of CNNs that are greedily trained using error-backprop, I predict that the “halving the error rates” claim in [1] is likely fraudulent by hand-labeling the test set. For more detail, see [17].

The organizers of the AlphaGo competition did not, intentionally or unintentionally, explicitly ban all human interactions with a competing machine. Public deserves an explanation of the absence of this ban. Thus, a human competitor without any computer supports was competing against an unknown number of human competitors behind the scene who have a lot of computer supports. The authors of [3] are expected to explain the PSUTS stage.

The same is true for the Project Debater. At least four human competitors were allowed to interact with the Project Debater during the competition (see YouTube video). Such interactively-participating human participants were allowed to conduct on-site human PSUTS using a lot of computer supports.

Gary Kasparov complained such human interactions with the Deep Blue behind the scene [18]. IBM probably improved by moving humans from the back scene for Deep Blue to the front scene in the Project Debater.

Therefore, the data reported for SuperVision [1], AlphaGo [3], the Project Debater [12] and Deep Blue lacked key transparency about the Post-Selection stages. The “out-perform” claims all lack solid evidence facing my alleged Post-Selection, using machine PSUTS or human PSUTS. The authors of [1], [3], [12] are expected to explain their PSUTS stages backed by independently verifiable data.

Likewise, all papers using Convolutional Neural Networks (CNN) have a responsibility to explain their Post-Selection stages. The only deep CNN that I am aware of that does not use any Post-Selection is Cresceptron [19] because it generates feature neurons incrementally and fully-automatically with the skull fully closed (i.e., for all the hidden layers between the image layer and motor layer).

The PSUTS practice also exists in almost all other machine learning methods that require initializations of a large number of network parameters, including Hopfield Networks, Markov Models, Graphical Models, adversarial learning, LSTM, and genetic algorithms. The only neural networks that do not use error-backprop, convolution, or Post-Selection that are also optimal cross learning “life” are the category called Developmental Networks (DNs), because they not only generate neurons incrementally and fully-automatically but also adapt them optimally in the sense of maximum likelihood in a closed form. Explained in a layman’s term, in each DN’s frame time, the maximum likelihood computes the best internal representation in probability distribution without iterations, instead of iteratively minimizing an ad hoc error-measure in motor outputs.

Unlike traditional deep learning networks, a DN is not just a data fitter from images to motors but also learns an emergent universal Turing machine, becoming the first network capable of Autonomous Programming for General Purposes (APFGP) [20], general machine thinking [21] and conscious machine learning [22]. Thus, AI’s future is bright, not tarnished by the typical misconducts in AI reported here.

Last, paper [23] should be withdrawn, since it did not have a minimal transparency. The paper stated: “We used the inverse probability of treatment weighting to adjust for baseline confounding factors and to emulate randomization.” But, it did not mention (1) what AI method (re: inverse probability IPTW) was used, and (2) what feature representations of IPTW were applied to the “Flatiron Health database” and (3) how the uncertainty (re: inverse probability IPTW) in the real data are cross-validated to support “relaxing specific eligibility criteria”. The IPTW is a machine learning method that used probability estimates, subject to the above Post-Selection allegation, employed by the authors to deal with well recognized uncertainty in the real-human data. For a minimal transparency, the authors should explain how they cross-validate the conclusion “relaxing specific eligibility criteria” of the IPTW method on new human data with uncertainty. Therefore, in the present form, all major data (Table 1 and Fig. 2) reported by paper [23] is unfounded and unjustified, if not yet found to be fraudulent, incorrect, misleading and inaccurate because of (a) the lack of minimum transparency about the claimed AI method and (b) the possible Post-Selections using PSUTS under IPTW.

Redress sought: request all the papers [1], [2], [3], [4], [5], [6], [7], [8], [9], [10], [11], [12], [13], [14], [15] to provide and publish a detailed explanation of the Post-Selection stage. Specifically, explain how many systems were trained by the reported methods (including random weights and hyper-parameters), the average, minimum and maximum performances of all trained systems in the Post-Selection stages and why the average performances were not reported in the original papers. If the originally reported performance is considerably better than the average performance of all trained systems, retract the corresponding papers for a violation of the well-accepted principle of cross-validation (which requires to report the average performance instead). In addition, request [1], [3], [5] to report details of human interactions during the contests. Request paper [23] to explain what AI method was used and its Post-Selection stage if machine learning is involved.

If minimally acceptable responses are not received from the authors in a specified time window, Nature should declare the corresponding papers be withdrawn by Nature for a lack of acceptable transparency.

This subject is central to public trust on AI science, AI ethics and AI hype. Tax-payers have made huge investments worldwide to AI research in academic institutions; Public has been greatly affected by their huge investments to publicly traded companies. The authors of [1], [2], [3], [4], [5], [6], [7], [8], [9], [10], [11], [12], [13], [14], [15], [23] have declared affiliations to these institutions or companies. Please keep this discussion fully transparent to public. Public has the right to know all these AI discipline-wide problems. The cases reported here are no longer like isolated 96 incidents retracted from Science in the recent 35 years [24] or isolated ones elsewhere [25]. These lack-of-transparency problems and the pending ethical problems appear to be typical and continuous in all AI related papers published in Nature since 2015.


REFERENCES

  1. LeCun, Y., Bengio, L. & Hinton, G. Deep learning. Nature 521, 436–444 (2015).

  2. Mnih, V. et al. Human-level control through deep reinforcement learning. Nature 518, 529–533 (2015).

  3. Silver, D. et al. Mastering the game of go with deep neural networks and tree search. Nature 529, 484–489 (2016).

  4. Graves, A. et al. Hybrid computing using a neural network with dynamic external memory. Nature 538, 471–476 (2016).

  5. Silver, D. et al. Mastering the game of go without human knowledge. Nature 354–359 (2017).

  6. McKinney, S. M. et al. International evaluation of an AI system for breast cancer screening. Nature 577, 89–94 (2020).

  7. Senior, A. W. et al. Improved protein structure prediction using potentials from deep learning. Nature 577, 706–710 (2020).

  8. Bellemare, M. G. et al. Autonomous navigation of stratospheric balloons using reinforcement learning. Nature 588, 77–82 (2020).

  9. Ecoffet, A., Huizinga, J., Lehman, J., Stanley, K. O. & Clune, J. First return, then explore. Nature 590, 580–586 (2021).

  10. Saggio, V. et al. Experimental quantum speed-up in reinforcement learning agents. Nature 591, 229–233 (2021).

  11. Willett, F. R., Avansino, D. T., Hochberg, L. R., Henderson, J. M. & Shenoy, K. V. High-performance brain-to-text communication via handwriting. Nature 593, 249–254 (2021).

  12. Slonim, N. et al. An autonomous debating system. Nature 591, 379–384 (2021).

  13. Mirhoseini, A. et al. A graph placement methodology for fast chip design. Nature 594, 207–212 (2021).

  14. Lu, M. Y. et al. AI-based pathology predicts origins for cancers of unknown primary. Nature 594, 106–110 (2021).

  15. Warnat-Herresthal, S. et al. Swarm learning for decentralized and confidential clinical machine learning. Nature 594, 265–270 (2021).

  16. Weng, J. On post selections using test sets (PSUTS) in AI. In Proc. International Joint Conference on Neural Networks, 1–8 (Shengzhen, China, 2021).

  17. Weng, J. Did Turing Awards go to fraud? YouTube Video (2020). 1:04 hours, https://youtu.be/Rz6CFlKrx2k.

  18. Silver, A. Deep blue’s cheating move. Chess News (2015). https://en.chessbase.com/post/deep-blue-s-cheating-move.

  19. Weng, J., Ahuja, N. & Huang, T. S. Learning recognition and segmentation using the Cresceptron. International Journal of Computer Vision 25, 109–143 (1997).

  20. Weng, J. Autonomous programming for general purposes: Theory. International Journal of Huamnoid Robotics 17, 1–36 (2020).

  21. Wu, X. & Weng, J. On machine thinking. In Proc. International Joint Conf. Neural Networks, 1–8 (IEEE Press, Shenzhen, China, 2021).

  22. Weng, J. Conscious intelligence requires developmental autonomous programming for general purposes. In Proc. IEEE International Conference on Development and Learning and Epigenetic Robotics, 1–7 (Valparaiso, Chile, 2020).

  23. Liu, R. et al. Evaluating eligibility criteria of oncology trials using real-world data and AI. Nature 592, 629–633 (2021).

  24. Anderson, L. E. & Wray, K. B. Detecting errors that result in retractions. Social Studies of Science 49, 942–954 (2019).

  25. Steen, R. G. Retractions in the scientific literature: is the incidence of research fraud increasing? Journal of Medical Ethics 37, 249–253 (2011).